首页 > 代码库 > SQL基础笔记

SQL基础笔记

Codecademy中Learn SQL, SQL: Table Transformaton和SQL: Analyzing Business Metrics三门课程的笔记,以及补充的附加笔记。
Codecademy的课程以SQLite编写,笔记中改成了MySQL语句。
 
I. Learn SQL
 
 
1. Manipulation - Create, edit, delete data
 
1.4 Create 创建数据库或数据库中的表
 
CREATE TABLE celebs 
    (
    id INTEGER, 
    name TEXT, 
    age INTEGER 
    ); # 第一列id,数据类型整数;第二列name,数据类型文本;第三列age,数据类型整数

 

1.5 Insert 向表中插入行
 
INSERT INTO celebs ( id, name, age)
    VALUES ( 1, Alan Mathison Turing, 42); # 在celebs表最下方插入数据:id列为1,name列为Alan Mathion Turing,age列为42

 

1.6 Select 选取数据
 
SELECT 
    *
FROM
    celebs; # 显示celebs表所有数据

 

1.7 Update 修改数据
 
UPDATE celebs 
SET 
    age = 22
WHERE
    id = 1; # 将celebs表中id=1的行的age改为22

 

1.8 Alert 更改表结构或数据类型
 
ALERT TABLE celebs
ADD COLUMN twitter_handle TEXT; # 在celebs表增加twitter_handle列

 

ALERT TABLE test.data
CHANGE COLUMN Mobile Mobile BLOB NULL DEFAULT NULL; # 将表test.data的Mobile列的数据类型改为BLOB,该列数据默认为NULL

 

1.9 DELETE 删除行
 
DELETE FROM celebs 
WHERE
    twitter_handle IS NULL; # 删除表celebs中twitter_handle为NULL的行

 

 
2. Queries - Retrieve data
 
2.3 Select Distinct 返回唯一不同的值
 
SELECT DISTINCT
    genre
FROM
    movies; # 查询movies表中genre列的所有不重复值

 

2.4 Where 规定选择的标准
 
SELECT 
    *
FROM
    movies
WHERE
    imdb_rating > 8; # 查询movies表中imdb_rating大于8的行

 

= equals
!= not equals
> greater than
< less than
>= greater than or equal to
<= less than or equal to
 
2.5 Like I 在 WHERE 子句中搜索列中的指定模式
 
SELECT 
    *
FROM
    movies
WHERE
    name LIKE  Se_en;

 

2.6 Like II
 
SELECT 
    *
FROM
    movies
WHERE
    name LIKE a%;

 

SELECT 
    *
FROM
    movies
WHERE
    name LIKE %man%;

 

NB 通配符
‘_‘ substitutes any individual character
‘%‘ matches zero or more missing characters
‘[charlist]%‘ any individual character in string: WHERE city LIKE ‘[ALN]%‘ 以“A"或”L“或”N“开头的城市
‘[!charlist]%‘ any individual character not in string: WHERE city LIKE ‘[!ALN]%‘ 不以“A"或”L“或”N“开头的城市
 
2.7 Between 在 WHERE 子句中使用,选取介于两个值之间的数据范围
 
The BETWEEN operator is used to filter the result set within a certain range. The values can be numbers, text or dates.
SELECT 
    *
FROM
    movies
WHERE
    name BETWEEN A AND J; # 查询movies中name以A至J开头的所有行

 

NB: names that begin with letter "A" up to but not including "J".
不同的数据库对 BETWEEN...AND 操作符的处理方式是有差异的,有开区间、闭区间,也有半开半闭区间。
 
SELECT 
    *
FROM
    movies
WHERE
    year BETWEEN 1990 AND 2000; # 查询movies中year在1990至2000年间的行

 

NB: years between 1990 up to and including 2000
 
2.8 And 且运算符
 
AND is an operator that combines two conditions. Both conditions must be true for the row to be included in the result set.
SELECT 
    *
FROM
    movies
WHERE
    year BETWEEN 1990 AND 2000
        AND genre = comedy; # 查询movies中year在1990至2000间,且genre为comedy的行

 

2.9 Or 或运算符
 
OR is used to combine more than one condition in WHERE clause. It evaluates each condition separately and if any of the conditions are true than the row is added to the result set. OR is an operator that filters the result set to only include rows where either condition is true.
SELECT 
    *
FROM
    movies
WHERE
    genre = comedy OR year < 1980; # 查询movies中genre为comedy,或year小于1980的行

 

2.10 Order By 对结果集进行排序
 
SELECT 
    *
FROM
    movies
ORDER BY imdb_rating DESC; # 查询movies中的行,结果以imdb_rating降序排列

 

DESC sorts the result by a particular column in descending order (high to low or Z - A).
ASC ascending order (low to high or A - Z).
 
2.11 Limit 规定返回的记录的数目
 
LIMIT is a clause that lets you specify the maximum number of rows the result set will have.
SELECT 
    *
FROM
    movies
ORDER BY imdb_rating ASC
LIMIT 3;  # 查询movies中的行,结果以imdb_rating升序排列,仅返回前3行

 

MS SQL Server中使用SELECT TOP 3,Oracle中使用WHERE ROWNUM <= 5(?)
 
 
3. Aggregate Function
 
3.2 Count 返回匹配指定条件的行数
 
COUNT( ) is a function that takes the name of a column as an argument and counts the number of rows where the column is not NULL.
SELECT 
    COUNT(*)
FROM
    fake_apps
WHERE 
    price = 0; # 返回fake_apps中price=0的行数

 

3.3 Group By 合计函数
 
SELECT 
    price, COUNT(*)
FROM
    fake_apps
WHERE
    downloads > 2000
GROUP BY price; # 查询fake_apps表中downloads大于2000的行,将结果集根据price分组,返回price和行数

 

Here, our aggregate function is COUNT( ) and we are passing price as an argument(参数) to GROUP BY. SQL will count the total number of apps for each price in the table.
It is usually helpful to SELECT the column you pass as an argument to GROUP BY. Here we SELECT price and COUNT(*).
 
3.4 Sum 返回数值列的总数(总额)
 
SUM is a function that takes the name of a column as an argument and returns the sum of all the values in that column.
SELECT 
    category, SUM(downloads)
FROM
    fake_apps
GROUP BY category;
 
3.5 Max 返回一列中的最大值(NULL 值不包括在计算中)
 
MAX( ) is a function that takes the name of a column as an argument and returns the largest value in that column.
SELECT 
    name, category, MAX(downloads)
FROM
    fake_apps
GROUP BY category;
 
3.6 Min 返回一列中的最小值(NULL 值不包括在计算中)
 
MIN( ) is a function that takes the name of a column as an argument and returns the smallest value in that column.
SELECT 
    name, category, MIN(downloads)
FROM
    fake_apps
GROUP BY category;
 
3.7 Average 返回数值列的平均值(NULL 值不包括在计算中)
 
SELECT 
    price, AVG(downloads)
FROM
    fake_apps
GROUP BY price;

 

3.8 Round 把数值字段舍入为指定的小数位数
 
ROUND( ) is a function that takes a column name and an integer as an argument. It rounds the values in the column to the number of decimal places specified by the integer.
SELECT 
    price, ROUND(AVG(downloads), 2)
FROM
    fake_apps
GROUP BY price;
 
 
4. Multiple Tables
 
4.2 Primary Key 主键
 
A primary key serves as a unique identifier for each row or record in a given table. The primary key is literally an "id" value for a record. We could use this value to connect the table to other tables.
CREATE TABLE  artists
    (
    id INTEGER PRIMARY KET,
    name TEXT
    );
 
NB
By specifying that the "id" column is the "PRIMARY KEY", SQL make sure that:
1. None of the values in this column are "NULL";
2. Each value in this column is unique.
 
A table can not have more than one "PRIMARY KEY" column.
 
4.3 Foreign Key 外键
 
SELECT 
    *
FROM
    albums
WHERE
    artist_id = 3;

 

A foreign key is a column that contains the primary key of another table in the database. We use foreign keys and primary keys to connect rows in two different tables. One table‘s foreign key holds the value of another table‘s primary key. Unlike primary keys, foreign keys do not need to be unique and can be NULL. Here, artist_id is a foreign key in the "albums" table.
 
The relationship between the "artists" table and the "albums" table is the "id" value of the artists.
 
4.4 Cross Join 用于生成两张表的笛卡尔集
 
SELECT 
    albums.name, albums.year, artists.name
FROM
    albums,
    artists;

 

One way to query multiple tables is to write a SELECT statement with multiple table names seperated by a comma. This is also known as a "cross join".
 
When querying more than one table, column names need to be specified by table_name.column_name.
 
Unfortunately, the result of this cross join is not very useful. It combines every row of the "artists" table with every row of the "albums" table. It would be more useful to only combine the rows where the album was created by the artist.
 
4.5 Inner Join 内连接:在表中存在至少一个匹配时,INNER JOIN 关键字返回行
 
SELECT 
    *
FROM
    albums
        JOIN
    artists ON albums.artist_id = artists.id; # INNER JOIN等价于JOIN,写JOIN默认为INNER JOIN

 

In SQL, joins are used to combine rows from two or more tables. The most common type of join in SQL is an inner join.
 
An inner join will combine rows from different tables if the join condition is true.
1. SELECT *: specifies the columns our result set will have. Here * refers to every column in both tables;
2. FROM albums: specifies first table we are querying;
3. JOIN artists ON: specifies the type of join as well as the second table;
4. albums.artist_id = artists.id: is the join condition that describes how the two tables are related to each other. Here, SQL uses the foreign key column "artist_id" in the "albums" table to match it with exactly one row in the "artists" table with the same value in the "id" column. It will only match one row in the "artists" table because "id" is the PRIMARY KEY of "artists".
 
4.6 Left Outer Join 左外连接:即使右表中没有匹配,也从左表返回所有的行
 
SELECT 
    *
FROM
    albums
        LEFT JOIN
    artists ON albums.artist_id = artists.id;

 

Outer joins also combine rows from two or more tables, but unlike inner joins, they do not require the join condition to be met. Instead, every row in the left table is returned in the result set, and if the join condition is not met, the NULL values are used to fill in the columns from the right table.
 
RIGHT JOIN 右外链接:即使左表中没有匹配,也从右表返回所有的行
FULL JOIN 全链接:只要其中一个表中存在匹配,就返回行
 
4.7 Aliases 为列名称和表名称指定别名
 
AS is a keyword in SQL that allows you to rename a column or table using an alias. The new name can be anything you want as long as you put it inside of single quotes.
SELECT 
    albums.name AS Album,
    albums.year,
    artists.name AS Artist
FROM
    albums
        JOIN
    artists ON albums.artist_id = artists.id
WHERE
    albums.year > 1980;
 
NB
The columns have not been renamed in either table. The aliases only appear in the result set.
 
 
 
II. SQL: Table Transformation
 
 
1. Subqueries 子查询
 
1.2 Non-Correlated Subqueries I 不相关子查询
 
SELECT 
    *
FROM
    flights
WHERE
    origin IN (SELECT 
            code
        FROM
            airports
        WHERE
            elevation > 2000);

 

1.4 Non-Correlated Subqueries III
 
SELECT 
    a.dep_month,
    a.dep_day_of_week,
    AVG(a.flight_count) AS average_flights
FROM
    (SELECT 
        dep_month,
            dep_day_of_week,
            dep_date,
            COUNT(*) AS flight_count
    FROM
        flights
    GROUP BY 1 , 2 , 3) a
WHERE
    a.dep_day_of_week = Friday
GROUP BY 1 , 2
ORDER BY 1 , 2; # 返回每个月中,每个星期五的平均航班数量

 

结构
[outer query]
    FROM
    [inner query] a
WHERE
GROUP BY
ORDER BY
 
NB
"a": With the inner query, we create a virtual table. In the outer query, we can refer to the inner query as "a".
"1,2,3" in inner query: refer to the first, second and third columns selected
         for display                      DBMS
SELECT dep_month,                 (1)
dep_day_of_week,                     (2)
dep_date,                                    (3)
COUNT(*) AS flight_count         (4)
FROM flights
 
SELECT 
    a.dep_month,
    a.dep_day_of_week,
    AVG(a.flight_distance) AS average_distance
FROM
    (SELECT 
        dep_month,
    dep_day_of_week,
    dep_date,
    SUM(distance) AS flight_distance
    FROM
        flights
    GROUP BY 1 , 2 , 3) a
GROUP BY 1 , 2
ORDER BY 1 , 2; # 返回每个月中,每个周一、周二……至周日的平均飞行距离

 

1.5 Correlated Subqueries I 相关子查询
 
NB
In a correlated subquery, the subquery can not be run independently of the outer query. The order of operations is important in a correlated subquery:
1. A row is processed in the outer query;
2. Then, for that particular row in the outer query, the subquery is executed.
This means that for each row processed by the outer query, the subquery will also be processed for that row.
 
SELECT 
    id
FROM
    flights AS f
WHERE
    distance > (SELECT 
            AVG(distance)
        FROM
            flights
        WHERE
            carrier = f.carrier); # the list of all flights whose distance is above average for their carrier

 

1.6 Correlated Subqueries II
 
In the above query, the inner query has to be reexecuted for each flight. Correlated subqueries may appear elsewhere besides the WHERE clause, they can also appear in the SELECT.
 
SELECT 
    carrier,
    id,
    (SELECT 
            COUNT(*)
        FROM
            flights f
        WHERE
            f.id < flights.id
                AND f.carrier = flights.carrier) + 1 AS flight_sequence_number
FROM
    flights; # 结果集为航空公司,航班id以及序号。相同航空公司的航班,id越大则序号越大

 

相关子查询中,对于外查询执行的每一行,子查询都会为这一行执行一次。在这段代码中,每当外查询提取一行数据中的carrier和id,子查询就会COUNT表中有多少行的carrier与外查询中的行的carrier相同,且id小于外查询中的行,并在COUNT结果上+1,这一结果列别名为flight_sequence_number。于是,id越大的航班,序号就越大。
如果将"<"改为">",则id越大的航班,序号越小。
 
 
2. Set Operation
 
2.2 Union 并集 (only distinct values)
 
Sometimes, we need to merge two tables together and then query the merged result.
There are two ways of doing this:
1) Merge the rows, called a join.
2) Merge the columns, called a union.
 
SELECT 
    item_name
FROM
    legacy_products 
UNION SELECT 
    item_name
FROM
    new_products;

 

Each SELECT statement within the UNION must have the same number of columns with similar data types. The columns in each SELECT statement must be in the same order. By default, the UNION operator selects only distinct values.
 
2.3 Union All 并集 (allows duplicate values)
 
SELECT 
    AVG(sale_price) 
FROM
    (SELECT 
        id, sale_price
    FROM
        order_items UNION ALL SELECT 
        id, sale_price
    FROM
        order_items_historic) AS a;

 

2.4 Intersect 交集
 
Microsoft SQL Server‘s INTERSECT returns any distinct values that are returned by both the query on the left and right sides of the INTERSECT operand.
 
SELECT category FROM new_products
INTERSECT
SELECT category FROM legacy_products;

 

NB
MySQL不滋瓷INTERSECT,但可以用INNER JOIN+DISTINCT或WHERE...IN+DISTINCT或WHERE EXISTS实现:
 
SELECT DISTINCT
    category
FROM
    new_products
        INNER JOIN
    legacy_products USING (category);

SELECT DISTINCT
    category
FROM
    new_products
WHERE
    category IN (SELECT 
            category
        FROM
            legacy_products);

 

http://stackoverflow.com/questions/2621382/alternative-to-intersect-in-mysql
 
网上很多通过UNION ALL 实现的办法(如下)是错误的,可能会返回仅在一个表中出现且COUNT(*) > 1的值:
 
SELECT 
    category, COUNT(*)
FROM
    (SELECT 
        category
    FROM
        new_products UNION ALL SELECT 
        category
    FROM
        legacy_products) a
GROUP BY category
HAVING COUNT(*) > 1;

 

2.5 Except (MS SQL Server) / Minus (Oracle) 差集
 
SELECT category FROM legacy_products
EXCEPT # 在Oracle中为MINUS
SELECT category FROM new_products;

 

NB
MySQL不滋瓷差集,但可以用WHERE...IS NULL+DISTINCT或WHERE...NOT IN+DISTINCT或WHERE EXISTS实现:
 
SELECT DISTINCT
    category
FROM
    legacy_products
        LEFT JOIN
    new_products USING (category)
WHERE
    new_products.category IS NULL;

SELECT DISTINCT
    category
FROM
    legacy_products
WHERE
    category NOT IN (SELECT 
            category
        FROM
            new_products);

 

 
3. Conditional Aggregates
 
3.2 NULL
 
use IS NULL or IS NOT NULL in the WHERE clause to test whether a value is or is not null.
 
SELECT 
    COUNT(*)
FROM
    flights
WHERE
    arr_time IS NOT NULL
        AND destination = ATL;

 

3.3 CASE WHEN "if, then, else"
 
SELECT
    CASE
        WHEN elevation < 250 THEN Low
        WHEN elevation BETWEEN 250 AND 1749 THEN Medium
        WHEN elevation >= 1750 THEN High
        ELSE Unknown
    END AS elevation_tier
    , COUNT(*)
FROM airports
GROUP BY 1;

 

END is required to terminate the statement, but ELSE is optionalIf ELSE is not included, the result will be NULL.
 
3.4 COUNT(CASE WHEN)
 
count the number of low elevation airports by state where low elevation is defined as less than 1000 ft.
SELECT 
    state,
    COUNT(CASE
        WHEN elevation < 1000 THEN 1
        ELSE NULL
    END) AS count_low_elevaton_airports
FROM
    airports
GROUP BY state; 

 

3.5 SUM(CASE WHEN)
 
sum the total flight distance and compare that to the sum of flight distance from a particular airline (in this case, Delta) by origin airport. 
SELECT 
    origin,
    SUM(distance) AS total_flight_distance,
    SUM(CASE
        WHEN carrier = DL THEN distance
        ELSE 0
    END) AS total_delta_flight_distance
FROM
    flights
GROUP BY origin; 

 

3.6 Combining aggregates
 
find out the percentage of flight distance that is from Delta by origin airport. 
SELECT 
    origin,
    100.0 * (SUM(CASE
        WHEN carrier = DL THEN distance
        ELSE 0
    END) / SUM(distance)) AS percentage_flight_distance_from_delta
FROM
    flights
GROUP BY origin; 

 

3.7 Combining aggregates II
 
Find the percentage of high elevation airports (elevation >= 2000) by state from the airports table.
SELECT 
    state,
    100.0 * COUNT(CASE
        WHEN elevation >= 2000 THEN 1
        ELSE NULL
    END) / COUNT(elevation) AS percentage_high_elevation_airports
FROM
    airports
GROUP BY 1;

SELECT 
    state,
    100.0 * SUM(CASE
        WHEN elevation >= 2000 THEN 1
        ELSE 0
    END) / COUNT(elevation) AS percentage_high_elevation_airports
FROM
    airports
GROUP BY 1;

 

 
4. Date, Number and String Functions
 
MySQL Date 函数:
https://dev.mysql.com/doc/refman/5.7/en/date-and-time-functions.html
 
NOW()    返回当前的日期和时间
CURDATE()    返回当前的日期
CURTIME()    返回当前的时间
DATE()    提取日期或日期/时间表达式的日期部分
EXTRACT()    返回日期/时间的单独部分,比如年、月、日、小时、分钟等等
DATE_ADD()    给日期添加指定的时间间隔
DATE_SUB()    从日期减去指定的时间间隔
DATEDIFF()    返回两个日期之间的天数
DATE_FORMAT()    用不同的格式显示日期/时间
 
例1:
 
SELECT NOW(), CURDATE(), CURTIME();
结果:
NOW()                            CURDATE()    CURTIME()
2008-12-29 16:25:46      2008-12-29       16:25:46
 
例2:
 
CREATE TABLE Orders (
    OrderId INT NOT NULL,
    ProductName VARCHAR(50) NOT NULL,
    OrderDate DATETIME NOT NULL DEFAULT NOW (),
    PRIMARY KEY (OrderId)
);

 

OrderDate 列规定 NOW() 作为默认值。作为结果,当您向表中插入行时,当前日期和时间自动插入列中。
 
例3:
 
EXTRACT (unit FROM date): date参数是合法的日期表达式,unit参数可以是下列值:
DATE_ADD(date,INTERVAL expr type): date参数是合法的日期表达式,expr是希望添加的时间间隔,type参数可以是下列值:
DATE_SUB(date,INTERVAL expr type): date参数是合法的日期表达式,expr是希望添加的时间间隔,type参数可以是下列值:
 
MICROSECOND
SECOND
MINUTE
HOUR
DAY
WEEK
MONTH
QUARTER
YEAR
SECOND_MICROSECOND
MINUTE_MICROSECOND
MINUTE_SECOND
HOUR_MICROSECOND
HOUR_SECOND
HOUR_MINUTE
DAY_MICROSECOND
DAY_SECOND
DAY_MINUTE
DAY_HOUR
 
例4:
 
DATEDIFF(date1,date2): date1 和 date2 参数是合法的日期或日期/时间表达式。只有值的日期部分参与计算。
 
例5:
 
DATE_FORMAT(date,format): date 参数是合法的日期。format 规定日期/时间的输出格式。
可以使用的格式有:http://www.w3school.com.cn/sql/func_date_format.asp
 
SELECT 
    id,
    carrier,
    origin,
    destination,
    DATE_FORMAT(NOW(), %Y-%c-%d %T) AS datetime
FROM
    flights;

 

4.2 Dates
 
select the date and time of all deliveries in the baked_goods table using the column delivery_time.
SELECT 
    DATE(delivery_time), TIME(delivery_time)
FROM
    baked_goods;

 

4.4 Dates III
 
Each of the baked goods is packaged five hours, twenty minutes, and two days after the delivery. Create a query returning all the packaging times for the goods in the baked_goods table.
SELECT 
    DATE_ADD(delivery_time,
        INTERVAL 2 5:20:00 DAY_SECOND) AS package_time
FROM
    baked_goods; 

 

DATE:
http://www.cnblogs.com/wenzichiqingwa/archive/2013/03/05/2944485.html
 
4.6 Numbers II
 
GREATEST(n1,n2,n3,...): returns the greatest value in the set of the input numeric expressions;
LEAST(n1,n2,n3,...): returns the least value in the set of the input numeric expressions;
 
Find the greatest time value for each item.
SELECT 
    id, GREATEST(cook_time, cool_down_time)
FROM
    baked_goods;

 

NB:不同数据类型的比较规则
 
上述命令的结果集为:
技术分享

 

而baked_goods中的cook_time和cool_down_time实际为:
技术分享

 

显然row 2,13和15中的33>5,20>8,45>8,但GREATEST返回的是较小的值
这是因为cook_time和cool_down_time这两列的数据类型是TEXT:
技术分享

 

在GREATEST和LEAST命令中,
当数据类型为TEXT等文本类时,比较的是字符串的大小,即从字符串的首个字符开始比较。‘5‘比>3‘,所以‘5‘>‘33‘。‘T‘>‘R‘,所以‘TORRES‘>‘RENE‘;
当数据类型为INT、BIGINT等数字类时,比较的才是数值的大小。
 
https://discuss.codecademy.com/t/6-numbers-ii-bugged/74569
 
4.7 Strings
 
CONCAT(concatenate):
 
Combine the first_name and last_name columns from the bakeries table as the full_name.
SELECT 
    CONCAT(first_name,  , last_name) AS full_name
FROM
    bakeries;

 

GROUP_CONCAT:
 
Combine the cities of the three states in city column from bakeries table as cities.
SELECT 
    state, GROUP_CONCAT(DISTINCT (city)) AS cities
FROM
    bakeries
WHERE
    state IN (California , New York, Texas)
GROUP BY state;

 

http://www.cnblogs.com/appleat/archive/2012/09/03/2669033.html
 
NB
CONCAT返回结果为连接参数产生的字符串。
如有任何一个参数为NULL ,则返回值为 NULL;
如果所有参数均为非二进制字符串,则结果为非二进制字符串;
如果自变量中含有任一二进制字符串,则结果为一个二进制字符串;
一个数字参数被转化为与之相等的二进制字符串格式;可使用显示类型CAST避免这种情况,例如:
SELECT CONCAT(CAST(int_col AS CHAR), char_col)
 
CAST:
 
例1:
 
SELECT CAST(12 AS DECIMAL) / 3;
 
Returns the result as a DECIMAL by casting one of the values as a decimal rather than an integer.
 
例2:
 
SELECT 
    CONCAT(CAST(distance AS CHAR),  , city)
FROM
    bakeries;

 

4.8 Strings II
 
REPLACE(string,from_string,to_string);
The function returns the string ‘string‘ with all occurrences of the string ‘from_string‘ replaced by the string ‘to_string‘.
 
Replace ‘enriched_flour‘ in the ingredients list to just ‘flour‘.
SELECT 
    id,
    REPLACE(ingredients,
        enriched_flour,
        flour)
FROM
    baked_goods;

 

 
 
III. SQL: Analyzing Business Metrics
 
 
1. Advanced Aggregates
 
1.4 Daily Revenue
 
how much we‘re making per day for kale-smoothies.
SELECT 
    DATE(ordered_at), ROUND(SUM(amount_paid), 2)
FROM
    orders
        JOIN
    order_items ON orders.id = order_items.order_id
WHERE
    name = kale-smoothie
GROUP BY 1
ORDER BY 1;

 

1.6 Meal Sums
 
total revenue of each item.
SELECT 
    name, ROUND(SUM(amount_paid), 2)
FROM
    order_items
GROUP BY name
ORDER BY 2 DESC;

 

1.7 Product Sum 2
 
percent of revenue each product represents.
SELECT 
    name,
    ROUND(SUM(amount_paid) / (SELECT 
                    SUM(amount_paid)
                FROM
                    order_items) * 100.0,
            2) AS PCT
FROM
    order_items
GROUP BY 1
ORDER BY 2 DESC;

 

Subqueries can be used to perform complicated calculations and create filtered or aggregate tables on the fly.
 
1.9 Grouping with Case Statements
 
group the order items by what type of food they are.
SELECT 
    *,
    CASE name
        WHEN kale-smoothie THEN smoothie
        WHEN banana-smoothie THEN smoothie
        WHEN orange-juice THEN drink
        WHEN soda THEN drink
        WHEN blt THEN sandwich
        WHEN grilled-cheese THEN sandwich
        WHEN tikka-masala THEN dinner
        WHEN chicken-parm THEN dinner
        ELSE other
    END AS category
FROM
    order_items
ORDER BY id;

 

look at percents of purchase by category:
SELECT 
    CASE name
        WHEN kale-smoothie THEN smoothie
        WHEN banana-smoothie THEN smoothie
        WHEN orange-juice THEN drink
        WHEN soda THEN drink
        WHEN blt THEN sandwich
        WHEN grilled-cheese THEN sandwich
        WHEN tikka-masala THEN dinner
        WHEN chicken-parm THEN dinner
        ELSE other
    END AS category,
    ROUND(1.0 * SUM(amount_paid) / (SELECT 
                    SUM(amount_paid)
                FROM
                    order_items) * 100,
            2) AS PCT
FROM
    order_items
GROUP BY 1
ORDER BY 2 DESC;

 

NB
Here 1.0 * is a shortcut to ensure the database represents the percent as a decimal.
 
1.11 Recorder Rates
 
We‘ll define reorder rate as the ratio of the total number of orders to the number of people making those orders. A lower ratio means most of the orders are reorders. A higher ratio means more of the orders are first purchases.
 
SELECT 
    name,
    ROUND(1.0 * COUNT(DISTINCT order_id) / COUNT(DISTINCT delivered_to),
            2) AS reorder_rate
FROM
    order_items
        JOIN
    orders ON orders.id = order_items.order_id
GROUP BY 1
ORDER BY 2 DESC;

 

 
2. Common Metrics
 
2.2 Daily Revenue
 
SELECT 
    DATE(created_at), ROUND(SUM(price), 2)
FROM
    purchases
GROUP BY 1
ORDER BY 1;

 

2.3 Daily Revenue 2
 
Update our daily revenue query to exclude refunds.
SELECT 
    DATE(created_at), ROUND(SUM(price), 2) AS daily_rev
FROM
    purchases
WHERE
    refunded_at IS NULL
GROUP BY 1
ORDER BY 1;

 

这里Codecademy的Instructions和代码识别都要求“WHERE refunded_at IS NOT NULL”,应该是写错了。
 
2.4 Daily Active Users
 
Calculate DAU
SELECT 
    DATE(created_at), COUNT(DISTINCT user_id) AS DAU
FROM
    gameplays
GROUP BY 1
ORDER BY 1;

 

2.5 Daily Active Users 2
 
Calculate DAU per-platform
SELECT 
    DATE(created_at), platform, COUNT(DISTINCT user_id) AS DAU
FROM
    gameplays
GROUP BY 1 , 2
ORDER BY 1 , 2;

 

2.6 Daily Average Revenue Per Paying User(ARPPU)
 
Calculate Daily ARPPU
SELECT 
    DATE(created_at),
    ROUND(SUM(price) / COUNT(DISTINCT user_id), 2) AS ARPPU
FROM
    purchases
WHERE
    refunded_at IS NULL
GROUP BY 1
ORDER BY 1;

 

2.8 ARPU 2
 
One way to create and organize temporary results in a query is with CTEs, Common Table Expressions, aka WITH ... AS clauses. The WITH ... AS clauses make it easy to define and use results in a more organized way than subqueries.
 
NB
MySQL不滋瓷CTE。可能可以用临时表实现,待验证。(?)
 
Calculate Daily Revenue
WITH daily_revenue AS 
    (
    SELECT 
        date(created_at) AS dt,
        ROUND(SUM(price), 2) AS rev
    FROM
        purchases
    WHERE refunded_at IS NULL
    GROUP BY 1
    )
SELECT 
    * 
FROM 
    daily_revenue 
ORDER BY dt;

 

2.9 ARPU 3
 
Calculate Daily ARPU
WITH daily_revenue AS (
  SELECT
    DATE(created_at) AS dt,
    ROUND(SUM(price), 2) AS rev
  FROM purchases
  WHERE refunded_at IS NULL
  GROUP BY 1
), 
daily_players AS (
  SELECT
    DATE(created_at) AS dt,
    COUNT(DISTINCT user_id) AS players
  FROM gameplays
  GROUP BY 1
)
SELECT
  daily_revenue.dt,
  daily_revenue.rev / daily_players.players
FROM daily_revenue
  JOIN daily_players USING (dt);

 

2.12 1 Day Retention 2
 
SELF JOIN:
By using a self-join, we can make multiple gameplays available on the same row of results. This will enable us to calculate retention.
The power of self-join comes from joining every row to every other row. This makes it possible to compare values from two different rows in the new result set.
 
SELECT 
    DATE(g1.created_at) AS dt, g1.user_id
FROM
    gameplays AS g1
        JOIN
    gameplays AS g2 ON g1.user_id = g2.user_id
ORDER BY 1
LIMIT 100;

 

2.13 1 Day Retention 3
 
Calculate 1 Day Retention Rate
SELECT 
    DATE(g1.created_at) AS dt,
    ROUND(100 * COUNT(DISTINCT g2.user_id) / COUNT(DISTINCT g1.user_id)) AS retention
FROM
    gameplays AS g1
        LEFT JOIN
    gameplays AS g2 ON g1.user_id = g2.user_id
        AND DATE(g1.created_at) = DATE(DATE_SUB(g2.created_at, INTERVAL 1 DAY))
GROUP BY 1
ORDER BY 1;

 

NB
游戏行业中,公认的次日留存率(1 Day Retention Rate)定义为:DNU在次日再次登录的比例。而本题计算的是:DAU在次日再次登录的比例。
 
 
 
IV. 附录
 
 
1. DISTINCT和GROUP BY的去重逻辑浅析
 
SELECT 
    COUNT(DISTINCT amount_paid)
FROM
    order_items;

SELECT 
    COUNT(1)
FROM
    (SELECT 
        1
    FROM
        order_items
    GROUP BY amount_paid) a;

SELECT 
    SUM(1)
FROM
    (SELECT 
        1
    FROM
        order_items
    GROUP BY amount_paid) a;

 

分别是在运算和存储上的权衡:

DISTINCT需要将列中的全部内容存储在一个内存中,将所有不同值存起来,内存消耗可能较大;
GROUP BY先将列排序,排序的基本理论是:时间复杂为nlogn,空间为1。优点是空间复杂度小,(?)缺点是执行时间会较长。
 
使用时根据具体情况取舍:数据分布离散时,使用GROUP BY;数据分布集中时,使用DISTINCT,效率高,空间占用较小。
 
 
2. SELECT 1 FROM table
 
1是一常量(可以为任意数值),查到的所有行的值都是它,但从效率上来说,1>column name>*,因为不用查字典表(?)
没有特殊含义,只要有数据就返回1,没有则返回NULL。
常用于EXISTS、子查询中,一般在判断子查询是否成功(即是否有满足条件)时使用,如:
SELECT 
    *
FROM
    orders
WHERE
    EXISTS( SELECT 
            1
        FROM
            orders o
                JOIN
            order_items i ON o.id = i.order_id);

 

 
3. WHERE 1=1和WHERE 1=0的作用
 
3.1 WHERE 1=1:条件恒真,同理WHERE ‘a‘ = ‘a‘等。在构造动态SQL语句,如:不定数量查询条件时,1=1可以很方便地规范语句:
 
3.1.1 制作查询页面时,若可查询的选项有多个,且用户可自行选择并输入关键词,那么按照平时的查询语句的动态构造,代码大致如下:
MySqlStr="SELECT * FROM table WHERE";

  IF(Age.Text.Lenght>0)
  {
    MySqlStr=MySqlStr+"Age="+"Age.Text";
  }

  IF(Address.Text.Lenght>0)
  {
    MySqlStr=MySqlStr+"AND Address="+"Address.Text";
  }

 

1)如果上述的两个IF判断语句,均为True,即用户都输入了查询词,那么,最终的MySqlStr动态构造语句变为:
MySqlStr="SELECT * FROM table WHERE Age=‘27‘ AND Address=‘广东省深圳市南山区科兴科学园‘"语句完整,能够被正确执行;
2)如果两个IF都不成立,MySqlStr="SELECT * FROM table WHERE";,语句错误,无法执行。
 
3.1.2 使用WHERE 1=1:
 
1)MySqlStr="SELECT * FROM table WHERE 1=1 AND Age=‘27‘ AND Address=‘广东省深圳市南山区科兴科学园‘";,正确可执行;
2)MySqlStr="SELECT * FROM table WHERE 1=1";,由于WHERE 1=1恒真,该语句能够被正确执行,作用相当于:MySqlStr="SELECT * FROM table";
 
也就是说:如果用户在多条件查询页面中,不选择任何字段、不输入任何关键词,那么,必将返回表中所有数据;如果用户在页面中,选择了部分字段并且输入了部分查询关键词,那么,就按用户设置的条件进行查询。
 
WHERE 1=1仅仅只是为了满足多条件查询页面中不确定的各种因素而采用的一种构造一条正确能运行的动态SQL语句的一种方法。
 
3.2 WHERE 1=0:条件恒假,同理WHERE 1 <> 1等。不会返回任何数据,只有表结构,可用于快速建表:
 
3.2.1 用于读取表的结构而不考虑表中的数据,这样节省了内存,因为可以不用保存结果集:
SELECT 
    *
FROM
    table
WHERE
    1 = 0; 

 

3.2.2 创建一个新表,新表的结构与查询的表的结构相同:
CREATE TABLE newtable AS SELECT * FROM
    oldtable
WHERE
    1 = 0;  

 

http://www.cnblogs.com/junyuz/archive/2011/03/10/1979646.html
 
 
4. 复制表
 
CREATE TABLE newtable AS SELECT * FROM
    oldtable;  

 

5. Having
 
NB
在 SQL 中增加 HAVING 子句原因是,WHERE 关键字无法与聚合函数(aggregate function)一起使用。常用的聚合函数有:COUNT,SUM,AVG,MAX,MIN等。
 
例1:查找订单总金额少于 2000 的客户
 
SELECT 
    Customer, SUM(OrderPrice)
FROM
    Orders
GROUP BY Customer
HAVING SUM(OrderPrice) < 2000;

 

例2:查找客户 "Bush" 或 "Adams" 拥有超过 1500 的订单总金额
 
SELECT 
    Customer, SUM(OrderPrice)
FROM
    Orders
WHERE
    Customer = Bush OR Customer = Adams
GROUP BY Customer
HAVING SUM(OrderPrice) > 1500;

 

http://www.w3school.com.cn/sql/sql_having.asp
 
 
6. USING
 
It is mostly syntactic sugar, but a couple differences are noteworthy:
 
ON is the more general of the two. One can JOIN tables ON a column, a set of columns and even a condition. For example:
SELECT 
    *
FROM
    world.City
        JOIN
    world.Country ON (City.CountryCode = Country.Code) 
WHERE ...

 

USING is useful when both tables share a column of the exact same name on which they join. In this case, one may say:
SELECT 
    film.title, film_id # film_id is not prefixed
FROM
    film
        JOIN
    film_actor USING (film_id)
WHERE ...

 

To do the above with ON, we would have to write:
SELECT 
    film.title, film.film_id # film.film_id is required here
FROM
    film
        JOIN
    film_actor ON (film.film_id = film_actor.film_id)
WHERE ...

 

NB film.film_id qualification in the SELECT clause. It would be invalid to just say film_id since that would make for an ambiguity.
 
http://stackoverflow.com/questions/11366006/mysql-on-vs-using
 
 
7. IFNULL()和COALESCE()函数 用于规定如何处理NULL值
 
假如 "UnitsOnOrder" 列可选,且包含 NULL 值。使用:
SELECT
    ProductName, UnitPrice * (UnitsInStock + UnitsOnOrder)
FROM
    Products;

 

由于 "UnitsOnOrder" 列存在NULL值,那么结果是 NULL。
为了便于计算,我们希望如果值是NULL,则返回0:
SELECT
    ProductName, UnitPrice * (UnitsInStock + IFNULL(UnitsOnOrder, 0))
FROM
    Products;

SELECT
    ProductName, UnitPrice * (UnitsInStock + COALESCE(UnitsOnOrder, 0))
FROM
    Products;

 

 
8. UCASE()和LCASE()函数 字段值得大小写转换
 
SELECT 
    LCASE(first_name) AS first_name,
    last_name,
    city,
    UCASE(state) AS state
FROM
    bakeries;

 

9. MID()函数 用于从文本字段中提取字符
 
SELECT 
    MID(column_name, start, length)
FROM
    table_name;

 

column_name: 必需。要提取字符的字段。
start: 必需。规定开始位置(起始值是 1)。
length: 可选。要返回的字符数。如果省略,则 MID() 函数返回剩余文本。
 
SELECT 
    MID(state, 1, 3) AS smallstate
FROM
    bakeries;

 

10. LENGTH()函数 返回文本字段中 值的长度
 
SELECT 
    LENGTH(City) AS LengthOfCity
FROM
    bakeries;

 

11. WHERE EXISTS命令
 
11.1 WHERE EXISTS与WHERE NOT EXISTS
 
EXISTS:判断子查询得到的结果集是否是一个空集,如果不是,则返回 True,如果是,则返回 False。即:如果在当前的表中存在符合条件的这样一条记录,那么返回 True,否则返回 False。
 
NOT EXISTS:作用与 EXISTS 正相反,当子查询的结果为空集时,返回 True,反之返回 False。也就是所谓的”若不存在“。
 
EXISTS和NOT EXISTS所在的查询属于相关子查询,即:对于外层父查询中的每一行,都执行一次子查询。先取父查询的第一个元组,根据它与子查询相关的属性处理子查询,若子查询WHERE条件表达式成立,则表达式返回True,则将此元组放入结果集,以此类推,直到遍历父查询表中的所有元组。
 
https://zhuanlan.zhihu.com/p/20005249
 
查询选修了任意课程的学生的姓名:
SELECT 
    Sname
FROM
    student
WHERE
    EXISTS( SELECT 
            *
        FROM
            sc,
            course
        WHERE
            sc.Sno = student.Sno
                AND sc.Cno = course.Cno);

 

查询未被200215123号学生选修的课程的课名:
SELECT 
    Cname
FROM
    course
WHERE
    NOT EXISTS( SELECT 
            *
        FROM
            sc
        WHERE
            Sno = 200215123
                AND Cno = course.Cno);

 

11.2 WHERE EXISTS与WHERE NOT EXISTS的双层嵌套
 
1)查询选修了全部课程的学生的课程:
SELECT 
    Sname
FROM
    student
WHERE
    NOT EXISTS( SELECT 
            *
        FROM
            course
        WHERE
            NOT EXISTS( SELECT 
                    *
                FROM
                    sc
                WHERE
                    Sno = student.Sno AND Cno = course.Cno));
 
思路:
STEP1:先取 Student 表中的第一个元组,得到其 Sno 列的值。
STEP2:再取 Course 表中的第一个元组,得到其 Cno 列的值。
STEP3:根据 Sno 与 Cno 的值,遍历 SC 表中的所有记录(也就是选课记录)。若对于某个 Sno 和 Cno 的值来说,在 SC 表中找不到相应的记录,则说明该 Sno 对应的学生没有选修该 Cno 对应的课程。
STEP4:对于某个学生来说,若在遍历 Course 表中所有记录(也就是所有课程)后,仍找不到任何一门他/她没有选修的课程,就说明此学生选修了全部的课程。
STEP5:将此学生放入结果元组集合中。
STEP6:回到 STEP1,取 Student 中的下一个元组。
STEP7:将所有结果元组集合显示。
其中第一个 NOT EXISTS 对应 STEP4,第二个 NOT EXISTS 对应 STEP3。
 
2)查询被所有学生选修的课程的课名:
SELECT 
    Cname
FROM
    course
WHERE
    NOT EXISTS( SELECT 
            *
        FROM
            student
        WHERE
            NOT EXISTS( SELECT 
                    *
                FROM
                    sc
                WHERE
                    Cno = course.Cno AND Sno = student.Sno));

 

3)查询选修了200215123号学生选修的全部课程的学生的学号:
SELECT 
    DISTINCT Sno
FROM
    sc scx
WHERE
    NOT EXISTS( SELECT 
            *
        FROM
            sc scy
        WHERE scy.Sno = 200215123
           AND NOT EXISTS( SELECT 
                    *
                FROM
                    sc
                WHERE
                    Sno = scx.Sno AND Cno = scy.Cno));

 

https://zhuanlan.zhihu.com/p/20005249
 
扩展:
技术分享

 

1. Condition 1:TA会C2。Condition 2:TA选修了课程 => exists+exists:查询选修了任意课程的学生;
2. Condition 1:TA不会C2。Condition 2:TA选修了课程 => not exists+exists:查询未选修任何课程的学生;
3. Condition 1:TA不会C2。Condition 2:TA有课程课没选 => not exists+not exists:查询选修了所有课程的学生;
4. Condition 1:TA会C2。Condition 2:TA有课程没选 => exists+not exists:查询未选修所有课程的学生;
 
11.3 WHERE EXISTS与WHERE IN的比较与使用
 
SELECT 
    Sname
FROM
    student
WHERE
    EXISTS( SELECT 
            *
        FROM
            sc,
            course
        WHERE
            sc.Sno = student.Sno
                AND sc.Cno = course.Cno
                AND course.Cname = 操作系统);

 

SELECT 
    Sname
FROM
    student
WHERE
    Sno IN (SELECT 
            Sno
        FROM
            sc,
            course
        WHERE
            sc.Sno = student.Sno
                AND sc.Cno = course.Cno
                AND course.Cname = 操作系统);

 

以上两个查询都返回选修了“操作系统”课程的学生的姓名,但原理不同:
EXISTS:对外表做loop循环,每次loop循环再对内表进行查询。首先检查父查询,然后运行子查询,直到它找到第一个匹配项。
IN:把内表和外表做hash连接。首先执行子查询,并将获得的结果存放在一个加了索引的临时表中。在执行子查询前,系统先将父查询挂起,待子查询执行完毕,存放在临时表中时以后再执行父查询。
 
因此:
1)如果两个表中一个较小,一个是较大,则子查询表大的用EXISTS,子查询表小的用IN
2)在查询的两个表大小相当时,3种查询方式的执行时间通常是:
EXISTS <= IN <= JOIN
NOT EXISTS <= NOT IN <= LEFT JOIN
只有当表中字段允许NULL时,NOT IN最慢:
NOT EXISTS <= LEFT JOIN <= NOT IN
3)无论哪个表大,NOT EXISTS都比NOT IN快。因为NOT IN会对内外表都进行全表扫描,没有用到索引;而NOT EXISTS的子查询依然能用到表上的索引。
 
https://www.zhihu.com/question/51685434/answer/127169379
http://blog.csdn.net/ldl22847/article/details/7800572
 
 
 
 
 
 
 
The trouble is, you think you have time.

SQL基础笔记