本文并不表示R在数据分析方面比Python更好或更快速,我本人每天都使用两种语言。这篇文章只是提供了比较这两种语言的机会。
本文中的 数据 每天都会更新,我的文件版本更大,为4.63 GB。
CSV文件包含纽约市的311条投诉。它是纽约市开放数据门户网站中最受欢迎的数据集。
需要创建一个帐户以连接到plotly API。或者,可以只使用默认的ggplot2图形。
假设已安装sqlite3(因此可通过终端访问)。
将数据加载到内存中。
| user.self | sys.self | elapsed | user.child | sys.child | |
|---|---|---|---|---|---|
| time_data.table | 63.588 | 1.952 | 65.633 | 0 | 0 |
| time_data.table_full | 205.571 | 3.124 | 208.880 | 0 | 0 |
| time_readr | 277.720 | 5.018 | 283.029 | 0 | 0 |
我将使用data.table读取数据。该 fread 函数大大提高了读取速度。
默认情况下,dplyr查询只会从数据库中提取前10行。
数据处理的两个最佳选择(除了R之外)是:
预览数据
| Agency | CreatedDate | ClosedDate | ComplaintType | Descriptor | City |
|---|---|---|---|---|---|
| NYPD | 04/11/2015 02:13:04 AM | Noise - Street/Sidewalk | Loud Music/Party | BROOKLYN | |
| DFTA | 04/11/2015 02:12:05 AM | Senior Center Complaint | N/A | ELMHURST | |
| NYPD | 04/11/2015 02:11:46 AM | Noise - Commercial | Loud Music/Party | JAMAICA | |
| NYPD | 04/11/2015 02:11:02 AM | Noise - Street/Sidewalk | Loud Talking | BROOKLYN | |
| NYPD | 04/11/2015 02:10:45 AM | Noise - Street/Sidewalk | Loud Music/Party | NEW YORK |
选择几列
| ComplaintType | Descriptor | Agency |
|---|---|---|
| Noise - Street/Sidewalk | Loud Music/Party | NYPD |
| Senior Center Complaint | N/A | DFTA |
| Noise - Commercial | Loud Music/Party | NYPD |
| Noise - Street/Sidewalk | Loud Talking | NYPD |
| Noise - Street/Sidewalk | Loud Music/Party | NYPD |
| ComplaintType | Descriptor | Agency |
|---|---|---|
| Noise - Street/Sidewalk | Loud Music/Party | NYPD |
| Senior Center Complaint | N/A | DFTA |
| Noise - Commercial | Loud Music/Party | NYPD |
| Noise - Street/Sidewalk | Loud Talking | NYPD |
| Noise - Street/Sidewalk | Loud Music/Party | NYPD |
| Noise - Street/Sidewalk | Loud Talking | NYPD |
| Noise - Commercial | Loud Music/Party | NYPD |
| HPD Literature Request | The ABCs of Housing - Spanish | HPD |
| Noise - Street/Sidewalk | Loud Talking | NYPD |
| Street Condition | Plate Condition - Noisy | DOT |
使用WHERE过滤行
| ComplaintType | Descriptor | Agency |
|---|---|---|
| Noise - Street/Sidewalk | Loud Music/Party | NYPD |
| Noise - Commercial | Loud Music/Party | NYPD |
| Noise - Street/Sidewalk | Loud Talking | NYPD |
| Noise - Street/Sidewalk | Loud Music/Party | NYPD |
| Noise - Street/Sidewalk | Loud Talking | NYPD |
使用WHERE和IN过滤列中的多个值
| ComplaintType | Descriptor | Agency |
|---|---|---|
| Noise - Street/Sidewalk | Loud Music/Party | NYPD |
| Noise - Commercial | Loud Music/Party | NYPD |
| Noise - Street/Sidewalk | Loud Talking | NYPD |
| Noise - Street/Sidewalk | Loud Music/Party | NYPD |
| Noise - Street/Sidewalk | Loud Talking | NYPD |
在DISTINCT列中查找唯一值
使用COUNT(*)和GROUP BY查询值计数
| Agency | No.Complaints |
|---|---|
| 3-1-1 | 22499 |
| ACS | 3 |
| AJC | 7 |
| ART | 3 |
| CAU | 8 |
使用ORDER和-排序结果
数据库中有多少个城市?
让我们来绘制10个最受关注的城市
| City | No.Complaints |
|---|---|
| BROOKLYN | 2671085 |
| NEW YORK | 1692514 |
| BRONX | 1624292 |
| 766378 | |
| STATEN ISLAND | 437395 |
| JAMAICA | 147133 |
| FLUSHING | 117669 |
| ASTORIA | 90570 |
| Jamaica | 67083 |
| RIDGEWOOD | 66411 |
UPPER 转换CITY格式。| CITY | No.Complaints |
|---|---|
| BROOKLYN | 2671085 |
| NEW YORK | 1692514 |
| BRONX | 1624292 |
| 766378 | |
| STATEN ISLAND | 437395 |
| JAMAICA | 147133 |
| FLUSHING | 117669 |
| ASTORIA | 90570 |
| JAMAICA | 67083 |
| RIDGEWOOD | 66411 |
投诉类型(按城市)
提供的数据不适合SQLite的标准日期格式。
在SQL数据库中创建一个新列,然后使用格式化的date语句重新插入数据 创建一个新表并将格式化日期插入原始列名。
使用时间戳字符串过滤SQLite行:YYYY-MM-DD hh:mm:ss
| ComplaintType | CreatedDate | City |
|---|---|---|
| Noise - Street/Sidewalk | 2014-11-12 11:59:56 | BRONX |
| Taxi Complaint | 2014-11-12 11:59:40 | BROOKLYN |
| Noise - Commercial | 2014-11-12 11:58:53 | BROOKLYN |
| Noise - Commercial | 2014-11-12 11:58:26 | NEW YORK |
| Noise - Street/Sidewalk | 2014-11-12 11:58:14 | NEW YORK |
使用strftime从时间戳中拉出小时单位
| ComplaintType | CreatedDate | City | hour |
|---|---|---|---|
| Noise - Street/Sidewalk | 2015-11-04 02:13:04 | BROOKLYN | 02 |
| Senior Center Complaint | 2015-11-04 02:12:05 | ELMHURST | 02 |
| Noise - Commercial | 2015-11-04 02:11:46 | JAMAICA | 02 |
| Noise - Street/Sidewalk | 2015-11-04 02:11:02 | BROOKLYN | 02 |
| Noise - Street/Sidewalk | 2015-11-04 02:10:45 | NEW YORK | 02 |
首先,创建一个时间戳记四舍五入到前15分钟间隔的新列
| CreatedDate | interval |
|---|---|
| 2015-11-04 02:13:04 | 2015-11-04 02:00:00 |
| 2015-11-04 02:12:05 | 2015-11-04 02:00:00 |
| 2015-11-04 02:11:46 | 2015-11-04 02:00:00 |
| 2015-11-04 02:11:02 | 2015-11-04 02:00:00 |
| 2015-11-04 02:10:45 | 2015-11-04 02:00:00 |
| 2015-11-04 02:09:07 | 2015-11-04 02:00:00 |
| 2015-11-04 02:05:47 | 2015-11-04 02:00:00 |
| 2015-11-04 02:03:43 | 2015-11-04 02:00:00 |
| 2015-11-04 02:03:29 | 2015-11-04 02:00:00 |
| 2015-11-04 02:02:17 | 2015-11-04 02:00:00 |
绘制2003年的结果
大数据部落 -中国专业的第三方数据服务提供商,提供定制化的一站式数据挖掘和统计分析咨询服务
统计分析和数据挖掘咨询服务:y0.cn/teradat(咨询服务请联系官网客服)
【服务场景】
科研项目; 公司项目外包;线上线下一对一培训;数据爬虫采集;学术研究;报告撰写;市场调查。
【大数据部落】提供定制化的一站式数据挖掘和统计分析咨询



欢迎选修我们的R语言数据分析挖掘必知必会课程!
R语言大数据分析纽约市的311万条投诉统计可视化与时间序列分析
原文:https://www.cnblogs.com/tecdat/p/12092166.html