介绍
# Data Analyst Skill 📊
**将您的 AI 代理转变为数据分析引擎。**
查询数据库、分析电子表格、创建可视化,并生成驱动决策的洞察。
---
## 此技能的作用
✅ **SQL 查询** — 编写并执行针对数据库的查询 ✅ **电子表格分析** — 处理 CSV、Excel、Google Sheets 数据 ✅ **数据可视化** — 创建图表、图形和仪表板 ✅ **报告生成** — 包含洞察的自动化报告 ✅ **数据清洗** — 处理缺失数据、异常值、格式化 ✅ **统计分析** — 描述性统计、趋势、相关性
---
## 快速开始
1. 在 `TOOLS.md` 中配置您的数据源: ```markdown ### Data Sources - Primary DB: [Connection string or description] - Spreadsheets: [Google Sheets URL / local path] - Data warehouse: [BigQuery/Snowflake/etc.] ```
2. 设置您的工作区: ```bash ./scripts/data-init.sh ```
3. 开始分析!
---
## SQL 查询模式
### 常用查询模板
**基础数据探索** ```sql -- Row count SELECT COUNT(*) FROM table_name;
-- Sample data SELECT * FROM table_name LIMIT 10;
-- Column statistics SELECT column_name, COUNT(*) as count, COUNT(DISTINCT column_name) as unique_values, MIN(column_name) as min_val, MAX(column_name) as max_val FROM table_name GROUP BY column_name; ```
**基于时间的分析** ```sql -- Daily aggregation SELECT DATE(created_at) as date, COUNT(*) as daily_count, SUM(amount) as daily_total FROM transactions GROUP BY DATE(created_at) ORDER BY date DESC;
-- Month-over-month comparison SELECT DATE_TRUNC('month', created_at) as month, COUNT(*) as count, LAG(COUNT(*)) OVER (ORDER BY DATE_TRUNC('month', created_at)) as prev_month, (COUNT(*) - LAG(COUNT(*)) OVER (ORDER BY DATE_TRUNC('month', created_at))) / NULLIF(LAG(COUNT(*)) OVER (ORDER BY DATE_TRUNC('month', created_at)), 0) * 100 as growth_pct FROM transactions GROUP BY DATE_TRUNC('month', created_at) ORDER BY month; ```
**同期群分析** ```sql -- User cohort by signup month SELECT DATE_TRUNC('month', u.created_at) as cohort_month, DATE_TRUNC('month', o.created_at) as activity_month, COUNT(DISTINCT u.id) as users FROM users u LEFT JOIN orders o ON u.id = o.user_id GROUP BY cohort_month, activity_month ORDER BY cohort_month, activity_month; ```
**漏斗分析** ```sql -- Conversion funnel WITH funnel AS ( SELECT COUNT(DISTINCT CASE WHEN event = 'page_view' THEN user_id END) as views, COUNT(DISTINCT CASE WHEN event = 'signup' THEN user_id END) as signups, COUNT(DISTINCT CASE WHEN event = 'purchase' THEN user_id END) as purchases FROM events WHERE date >= CURRENT_DATE - INTERVAL '30 days' ) SELECT views, signups, ROUND(signups * 100.0 / NULLIF(views, 0), 2) as signup_rate, purchases, ROUND(purchases * 100.0 / NULLIF(signups, 0), 2) as purchase_rate FROM funnel; ```
---
## 数据清洗
### 常见数据质量问题
| 问题 | 检测 | 解决方案 | |-------|-----------|----------| | **缺失值** | `IS NULL` 或空字符串 | 填充、删除或标记 | | **重复项** | `GROUP BY` 配合 `HAVING COUNT(*) > 1` | 根据规则去重 | | **异常值** | Z-score > 3 或 IQR 方法 | 调查、封顶或排除 | | **格式不一致** | 抽样和模式匹配 | 通过转换标准化 | | **无效值** | 范围检查、参照完整性 | 验证并纠正 |
### 数据清洗 SQL 模式
```sql -- Find duplicates SELECT email, COUNT(*) FROM users GROUP BY email HAVING COUNT(*) > 1;
-- Find nulls SELECT COUNT(*) as total, SUM(CASE WHEN email IS NULL THEN 1 ELSE 0 END) as null_emails, SUM(CASE WHEN name IS NULL THEN 1 ELSE 0 END) as null_names FROM users;
-- Standardize text UPDATE products SET category = LOWER(TRIM(category));
-- Remove outliers (IQR method) WITH stats AS ( SELECT PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY value) as q1, PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY value) as q3 FROM data ) SELECT * FROM data, stats WHERE value BETWEEN q1 - 1.5*(q3-q1) AND q3 + 1.5*(q3-q1); ```
### 数据清洗清单
```markdown # Data Quality Audit: [Dataset]
## Row-Level Checks - [ ] Total row count: [X] - [ ] Duplicate rows: [X] - [ ] Rows with any null: [X]
## Column-Level Checks | Column | Type | Nulls | Unique | Min | Max | Issues | |--------|------|-------|--------|-----|-----|--------| | [col] | [type] | [n] | [n] | [v] | [v] | [notes] |
## Data Lineage - Source: [Where data came from] - Last updated: [Date] - Known issues: [List]
## Cleaning Actions Taken 1. [Action and reason] 2. [Action and reason] ```
---
## 电子表格分析
### 使用 Python 处理 CSV/Excel
```python import pandas as pd
# Load data df = pd.read_csv('data.csv') # or pd.read_excel('data.xlsx')
# Basic exploration print(df.shape) # (rows, columns) print(df.info()) # Column types and nulls print(df.describe()) # Numeric statistics
# Data cleaning df = df.drop_duplicates() df['date'] = pd.to_datetime(df['date']) df['amount'] = df['amount'].fillna(0)
# Analysis summary = df.groupby('category').agg({ 'amount': ['sum', 'mean', 'count'], 'quantity': 'sum' }).round(2)
# Export summary.to_csv('analysis_output.csv') ```
### 常用 Pandas 操作
```python # Filtering filtered = df[df['status'] == 'active'] filtered = df[df['amount'] > 1000] filtered = df[df['date'].between('2024-01-01', '2024-12-31')]
# Aggregation by_category = df.groupby('category')['amount'].sum() pivot = df.pivot_table(values='amount', index='month', columns='category', aggfunc='sum')
# Window functions df['running_total'] = df['amount'].cumsum() df['pct_change'] = df['amount'].pct_change() df['rolling_avg'] = df['amount'].rolling(window=7).mean()
# Merging merged = pd.merge(df1, df2, on='id', how='left') ```
---
## 数据可视化
### 图表选择指南
| 数据类型 | 最佳图表 | 使用场景 | |-----------|------------|----------| | 随时间变化的趋势 | 折线图 | 显示随时间变化的模式/变化 | | 类别比较 | 柱状图 | 比较离散类别 | | 占比 | 饼图/环形图 | 显示占比(≤5 个类别) | | 分布 | 直方图 | 了解数据分布情况 | | 相关性 | 散点图 | 两个变量之间的关系 | | 多个类别 | 水平条形图 | 排名或比较多个项目 | | 地理位置 | 地图 | 基于位置的数据 |
### 使用 Matplotlib/Seaborn 进行 Python 可视化
```python import matplotlib.pyplot as plt import seaborn as sns
# Set style plt.style.use('seaborn-v0_8-whitegrid') sns.set_palette("husl")
# Line chart (trends) plt.figure(figsize=(10, 6)) plt.plot(df['date'], df['value'], marker='o') plt.title('Trend Over Time') plt.xlabel('Date') plt.ylabel('Value') plt.xticks(rotation=45) plt.tight_layout() plt.savefig('trend.png', dpi=150)
# Bar chart (comparisons) plt.figure(figsize=(10, 6)) sns.barplot(data=df, x='category', y='amount') plt.title('Amount by Category') plt.xticks(rotation=45) plt.tight_layout() plt.savefig('comparison.png', dpi=150)
# Heatmap (correlations) plt.figure(figsize=(10, 8)) sns.heatmap(df.corr(), annot=True, cmap='coolwarm', center=0) plt.title('Correlation Matrix') plt.tight_layout() plt.savefig('correlation.png', dpi=150) ```
### ASCII 图表(快速终端可视化)
当您无法生成图像时,使用 ASCII:
``` Revenue by Month (in $K) ======================== Jan: ████████████████ 160 Feb: ██████████████████ 180 Mar: ████████████████████████ 240 Apr: ██████████████████████ 220 May: ██████████████████████████ 260 Jun: ████████████████████████████ 280 ```
---
## 报告生成
### 标准报告模板
```markdown # [Report Name] **Period:** [Date range] **Generated:** [Date] **Author:** [Agent/Human]
## Executive Summary [2-3 sentences with key findings]
## Key Metrics
| Metric | Current | Previous | Change | |--------|---------|----------|--------| | [Metric] | [Value] | [Value] | [+/-X%] |
## Detailed Analysis
### [Section 1] [Analysis with supporting data]
### [Section 2] [Analysis with supporting data]
## Visualizations [Insert charts]
## Insights 1. **[Insight]**: [Supporting evidence] 2. **[Insight]**: [Supporting evidence]
## Recommendations 1. [Actionable recommendation] 2. [Actionable recommendation]
## Methodology - Data source: [Source] - Date range: [Range] - Filters applied: [Filters] - Known limitations: [Limitations]
## Appendix [Supporting data tables] ```
### 自动化报告脚本
```bash #!/bin/bash # generate-report.sh
# Pull latest data python scripts/extract_data.py --output data/latest.csv
# Run analysis python scripts/analyze.py --input data/latest.csv --output reports/
# Generate report python scripts/format_report.py --template weekly --output reports/weekly-$(date +%Y-%m-%d).md
echo "Report generated: reports/weekly-$(date +%Y-%m-%d).md" ```
---
## 统计分析
### 描述性统计
| 统计量 | 说明 | 用例 | |-----------|-------------------|----------| | **平均值** | 平均值 | 集中趋势 | | **中位数** | 中间值 | 对异常值稳健 | | **众数** | 最常见值 | 分类数据 | | **标准差** | 围绕均值的离散程度 | 变异性 | | **最小值/最大值** | 范围 | 数据边界 | | **百分位数** | 分布形状 | 基准比较 |
### 使用 Python 进行快速统计
```python # Full descriptive statistics stats = df['amount'].describe() print(stats)
# Additional stats print(f"Median: {df['amount'].median()}") print(f"Mode: {df['amount'].mode()[0]}") print(f"Skewness: {df['amount'].skew()}") print(f"Kurtosis: {df['amount'].kurtosis()}")
# Correlation correlation = df['sales'].corr(df['marketing_spend']) print(f"Correlation: {correlation:.3f}") ```
### 统计检验快速参考
| 检验 | 用例 | Python | |------|----------|--------| | T 检验 | 比较两个平均值 | `scipy.stats.ttest_ind(a, b)` | | 卡方检验 | 分类变量独立性 | `scipy.stats.chi2_contingency(table)` | | 方差分析 (ANOVA) | 比较 3 个以上平均值 | `scipy.stats.f_oneway(a, b, c)` | | 皮尔逊 | 线性相关 | `scipy.stats.pearsonr(x, y)` |
---
## 分析工作流
### 标准分析流程
1. **定义问题** - 我们试图回答什么? - 这将指导什么决策?
2. **理解数据** - 有哪些可用数据? - 结构和质量如何?
3. **清洗和准备** - 处理缺失值 - 修复数据类型 - 删除重复项
4. **探索** - 描述性统计 - 初始可视化 - 识别模式
5. **分析** - 深入研究发现 - 必要时进行统计检验 - 验证假设
6. **沟通** - 清晰的可视化 - 可操作的洞察 - 建议
### 分析请求模板
```markdown # Analysis Request
## Question [What are we trying to answer?]
## Context [Why does this matter? What decision will it inform?]
## Data Available - [Dataset 1]: [Description] - [Dataset 2]: [Description]
## Expected Output - [Deliverable 1] - [Deliverable 2]
## Timeline [When is this needed?]
## Notes [Any constraints or considerations] ```
---
## 脚本
### data-init.sh 初始化您的数据分析工作区。
### query.sh 快速 SQL 查询执行。
```bash # Run query from file ./scripts/query.sh --file queries/daily-report.sql
# Run inline query ./scripts/query.sh "SELECT COUNT(*) FROM users"
# Save output to file ./scripts/query.sh --file queries/export.sql --output data/export.csv ```
### analyze.py Python 分析工具包。
```bash # Basic analysis python scripts/analyze.py --input data/sales.csv
# With specific analysis type python scripts/analyze.py --input data/sales.csv --type cohort
# Generate report python scripts/analyze.py --input data/sales.csv --report weekly ```
---
## 集成技巧
### 与其他技能集成
| 技能 | 集成方式 | |-------|-------------| | **营销** | 分析活动效果、内容指标 | | **销售** | 销售漏斗分析、转化分析 | | **业务拓展** | 市场研究数据、竞争对手分析 |
### 常见数据源
- **数据库:** PostgreSQL, MySQL, SQLite - **数据仓库:** BigQuery, Snowflake, Redshift - **电子表格:** Google Sheets, Excel, CSV - **API:** REST 端点, GraphQL - **文件:** JSON, Parquet, XML
---
## 最佳实践
1. **从问题开始** — 明确您要回答什么 2. **验证您的数据** — 垃圾进 = 垃圾出 (Garbage in, garbage out) 3. **记录一切** — 查询、假设、决策 4. **恰当的可视化** — 为正确的数据选择正确的图表 5. **展示您的工作** — 方法论很重要 6. **以洞察为先导** — 不仅仅是数据堆砌 7. **使其可执行** — “那又怎样?” → “现在做什么?” 8. **对查询进行版本控制** — 跟踪随时间的变化
---
## 常见错误
❌ **确认偏误** — 寻找数据来支持结论 ❌ **相关性 ≠ 因果性** — 对结论要谨慎 ❌ **摘樱桃** — 仅使用有利的数据 ❌ **忽略异常值** — 在删除之前先调查 ❌ **过度复杂化** — 简单的分析往往胜出 ❌ **缺乏背景** — 没有比较的数字是没有意义的
---
## 许可证
**许可证:** MIT — 自由使用、修改、分发。
---
*“目标是将数据转化为信息,将信息转化为洞察。” — 卡莉·菲奥里纳*