Introduction
Bitcoin's growing popularity comes with inherent risks of misuse. While graph-based analysis offers potential for tracking cryptocurrency transactions, the lack of comprehensive datasets has been a significant barrier. This paper introduces the Bitcoin Address Behavior Dataset (BABD), a heterogeneous framework for constructing Bitcoin transaction graphs to extract analytical features. The dataset comprises:
- 13 address types
- 544,462 labeled data points
- 148 features across 5 indicator categories
- K-hop subgraphs for structural context
👉 Discover how Bitcoin transaction patterns reveal hidden insights
Key Challenges in Existing Research
Current Bitcoin transaction analysis methods face three critical limitations:
- Incomplete Address Typing: Most studies analyze ≤7 address types, insufficient for deep behavioral understanding.
- Unsystematic Metrics: Existing indicators lack categorization and omit crucial graph-derived features.
- Limited Reproducibility: Few studies disclose how transaction graphs are constructed.
Dataset Construction Methodology
1. Heterogeneous Graph Structure
Unlike simplified Bitcoin graphs that lose information, BABD uses a directed heterogeneous multigraph that preserves:
- Address (Ads) node characteristics
- Transaction (Tx) node attributes
This structure minimizes network information loss during pattern analysis.
2. Address Behavior Classification
The dataset categorizes 13 distinct Bitcoin wallet behaviors:
| Behavior Type | Examples |
|---|---|
| Criminal Activities | Ransomware, Darknet markets |
| Financial Services | Exchanges, P2P lending |
| Anonymization | Mixers, Laundering |
| Infrastructure | Mining pools, Personal wallets |
3. Data Collection Pipeline
- Network Crawling: API-based scraping of Bitcoin ledger data (100,001 blocks)
- Label Verification: Manual validation of address tags
- Data Categorization: Separation into Strong Addresses (SA) and Weak Addresses (WA)
Feature Extraction Framework
Statistical Indicators (SI)
| Category | Features | Description |
|---|---|---|
| PAI | Token count | Pure amount metrics |
| PDI | Purity ratios | Address correlation |
| PTI | Timestamps | Temporal patterns |
| CI | Combined features | Hybrid indicators |
Local Structure Indicators (LSI)
The 4-hop subgraph algorithm captures network topology by:
- Converting transaction graphs to undirected networks
Extracting structural features:
- Degree correlation
- Betweenness centrality
- PageRank values
- Network density
👉 Learn how subgraph analysis improves Bitcoin tracking
Experimental Results
Performance Metrics
| Model | Accuracy | Precision | Recall | F1-Score |
|---|---|---|---|---|
| XGBoost | 96.71% | 96.46% | 96.71% | 96.57% |
| Random Forest | 95.62% | 95.21% | 95.62% | 95.38% |
| SVM | 93.24% | 92.80% | 93.24% | 92.97% |
The SI+LSI combined features achieved consistent performance across all 13 classification tasks, demonstrating the framework's robustness.
Key Takeaways
- Heterogeneous Graphs provide richer analytical context than simplified structures
- Subgraph Sampling enables scalable analysis of massive transaction networks
- Composite Features (SI+LSI) outperform single-indicator approaches
FAQ Section
Q1: How does BABD improve upon existing Bitcoin datasets?
A: BABD offers more complete address typing (13 vs. ≤7 types), systematic feature categorization, and reproducible graph construction methods.
Q2: Why use 4-hop subgraphs?
A: Testing showed this optimally balances feature richness with computational limits—smaller hops lose context, larger hops become intractable.
Q3: What practical applications does this research enable?
A: The framework aids in detecting illicit activities (ransomware, mixing services), analyzing exchange behaviors, and improving wallet security analytics.
Q4: How were the 148 features selected?
A: Through iterative testing—starting with basic transaction metrics, then adding combined and graph-derived features that improved model performance.
Q5: Can this methodology apply to other cryptocurrencies?
A: Yes, with adjustments for chain-specific characteristics (e.g., Ethereum's smart contracts would require additional feature types).