Graph Construction
Raw Data Extraction
Since the inception of the Bitcoin economy, all transactions are recorded in a public ledger known as the Bitcoin blockchain15. This blockchain is sustained by a decentralized network of peers16. Every ten minutes, a new block of transactions is added. After installing Bitcoin Core (https://bitcoin.org/en/bitcoin-core) version 24.0, we established a Bitcoin node with standard configurations. This node enabled us to connect to the peer network and download the entire transaction ledger. The complete transaction history was stored in the local blockchain data directory, specifically within the ‘blkXXXXX.dat’ files located in the ‘blocks’ folder created by the node. We then utilized parsing techniques to extract all transaction details, ensuring the data’s accuracy for our analysis. In this study, we examined the transactions from the first 700,000 blocks of the blockchain, which occurs before the activation of the Taproot upgrade17. Taproot brought significant changes to Bitcoin’s transaction structure, requiring further adjustments to the data processing methods in use. By concentrating on blocks predating this upgrade, the analysis remains consistent and avoids potential methodological issues.
Definition of Nodes
All circulating bitcoins are held in unspent TXOs, each protected by a locking script. Multiple TXOs can be secured by the same script, allowing them to be spent by the same address or group of addresses. A transaction can be framed as a transfer of value from one set of scripts to another. Therefore, scripts can be seen as the owners of the bitcoins they secure. As such, locking scripts emerge as potential candidates for the nodes in the graph dataset. All TXOs and scripts that have existed are inferred from transaction data. In our analysis, TXOs with zero value were omitted, resulting in the identification of over 874 million scripts.
A script comes from a group of private keys possessed by one or more entities, thus designating these entities as the practical owners of the bitcoins secured by the script. Typically, a user may have multiple private keys for management, security, or privacy reasons14. Moreover, the derivation of a locking script from a group of private keys is not unique18. Consequently, a user generally owns or has owned bitcoins within TXOs safeguarded by various scripts. For a more comprehensive study of Bitcoin, it is favored to assess value transfers between actual entities or users instead of scripts. This method has been adopted in many previous research studies prior to this one6,7,8. Hence, it is necessary to identify and cluster scripts that likely belong to a single entity, which will then serve as a node in the graph.
We employed heuristics from prior research19. These heuristics leverage established behavioral patterns and inclinations of Bitcoin users, along with recognized human biases, to link scripts that appear in the same transaction. Consequently, the nodes in our graph represent clusters of scripts, with roughly 252 million clusters identified. Each cluster is tagged with a unique integer alias. Thus, a TXO will be defined by a value v and the alias a signifying the cluster of its locking script. Given that locking scripts are also derived from private keys, we will refer to them as addresses from this point forward.
Edges
A transaction is a transformation of a set of input TXOs Δin into a new set of output TXOs Δout. Nothing prevents an alias from appearing in both the inputs and the outputs. This situation commonly occurs, for example, when receiving change from a payment, as all input TXOs will be fully consumed regardless of the payment amount. As an alias can appear in both the input and output of a transaction, it is essential to ascertain whether this alias sends or receives value during the transaction. We define the value received by an alias a using the equation (1). This value is simply the difference between the value received in the output and the value spent in the input.
$${v}_{\Delta }(a)=\sum _{({v}^{{\prime} },{a}^{{\prime} })\in {\Delta }_{{\rm{out}}},\,{a}^{{\prime} }=a}{v}^{{\prime} }-\sum _{({v}^{{\prime} },{a}^{{\prime} })\in {\Delta }_{{\rm{in}}},\,{a}^{{\prime} }=a}{v}^{{\prime} }$$
(1)
As a result, the entity identified as a can be categorized as a recipient if the net value received is positive; otherwise, a is deemed a sender. The value transferred from sender a to recipient \({a}^{{\prime} }\) is determined as the portion of the total input value provided by a multiplied by the value received by \({a}^{{\prime} }\). An edge is then drawn from each sender to each recipient in the transactions considered.
CoinJoin and Colored Coin Transactions
CoinJoin is a specific type of transaction that enhances privacy within Bitcoin20. The transaction ledger is public, and each transaction can be analyzed, allowing for relatively straightforward tracing of value flows. For any given user, it is easy to track their wealth, the sources from which they receive value, and the destinations to which they send it, which substantially undermines their privacy. CoinJoin facilitates the amalgamation of numerous individual transactions from different users into one large transaction. Each participant contributes input TXOs and defines output TXOs without disclosing which inputs correspond to which outputs. This process obscures the origin and destination of transactions, complicating the efforts of external observers to trace the transfer flow of a particular user. Additionally, it interferes with certain clustering heuristics by causing them to merge scripts that do not belong to the same user. Hence, we have chosen to exclude these transactions from (1) the construction of script clusters and (2) the addition of edges in the graphs. The creation of this type of transaction is facilitated by specialized software, with several implementations available, including Wasabi (https://docs.wasabiwallet.io) and Whirlpool/Samourai (https://github.com/Samourai-Wallet/Whirlpool). We applied heuristics from previous work to identify these transactions20. These heuristics were derived from an analysis of the open-source implementations of these software solutions to identify recognizable patterns associated with such transactions.
Colored coin transactions serve to transfer value in forms beyond bitcoin, including alternative cryptocurrencies and tangible assets21. These transactions embed additional information, such as the asset type or the quantity being transferred, within the transaction’s locking script, making them relatively easy to identify. To effectively detect these transactions, we crafted heuristics based on established protocols such as Open Asset (https://github.com/OpenAssets), Omni Layer (https://github.com/OmniLayer), and EPOBC (https://github.com/chromaway/ngcccbase/wiki/EPOBC_simple). Consequently, we exclude these transactions from our graph construction to ensure the integrity of our analysis by concentrating only on standard Bitcoin transactions.
Attributes
Attributes assigned to edges depict the overall qualities of directed value transfers. Attributes associated with nodes are primarily drawn from the edges that involve those nodes, offering insights into their transactional behavior. Varied attributes are detailed in Tables 1 and 2. The blockchain is structured as a chain of blocks, with each containing a sequence of transactions. When defining attributes, the block index for a transaction refers to the index of the block encompassing the transaction within the blockchain. Hence, the block index can be considered a timestamp.
Overview of the Dataset Construction
The graph dataset is composed of two tables: one contains the nodes and their attributes, while the other contains the edges and their attributes. The data construction process involved several steps and the creation of multiple intermediary tables. The construction process is illustrated in Fig. 2. All code is developed in Python and is publicly accessible to ensure reproducibility. We utilize PostgreSQL (https://www.postgresql.org) for data storage in a database and the Python package Psycopg2 (https://www.psycopg.org) for database querying.
Dataset construction process. Simple rectangles represent PostgreSQL tables created during the process, while the darker section indicates the final dataset.
The code is organized as follows:
-
1.
Block Indexing The transaction blocks downloaded from the Bitcoin node are stored across several thousand binary files. These blocks are organized in the order received from peers rather than their chronological position in the blockchain. To facilitate data reading, we create a table that outlines each block’s location along with assorted metadata.
-
2.
Transaction Processing The blocks and their transactions are read in chronological order. For each transaction, the created TXOs are stored within the ‘CreatedTXO’ table, and the spent TXOs are stored in the ‘SpentTXO’ table. This ensures efficient retrieval of transaction-related data. During this process, we identify CoinJoin and colored coin transactions for exclusion from future analyses. Simultaneously, each encountered locking script (address) is cataloged in a table.
-
3.
Address Clustering In this step, we create address clusters. Initially, each address constitutes its own cluster. As we process transactions and implement clustering heuristics, the clusters merge until they yield the final clusters, which serve as nodes in our graph. We utilize a disjoint-set data structure for efficient cluster storage and merging. The final clusters are maintained in the ‘Alias’ table in the format of (address, cluster).
-
4.
Inter-Cluster Edges By this stage, we have identified the node corresponding to each address present in the transaction data, enabling us to ascertain the owning node for each TXO. We use equation (1) to calculate value transfers within a transaction. As we process the transactions, we add directed edges representing value transfers between nodes to the ‘TransactionEdge’ table. If an edge already exists between two nodes, its attributes are updated accordingly. From the ‘TransactionEdge’ table, we also create the ‘UndirectedTransactionEdge’ table, containing undirected edges between nodes, useful for calculating node degrees.
-
5.
Intra-Cluster Edges By reiterating the previous step but focusing on TXOs owned by addresses, we construct the ‘ClusterTransactionEdge’ table, which holds undirected edges for value transfers occurring within the same address cluster, facilitating the calculation of specific node features.
-
6.
Node Attribute Computation Node attributes are computed by analyzing the tables constructed in steps 3, 4, and 5, employing straightforward counting operations alongside summation, minimum, and maximum calculations.
Node Labels
Diverse real-world entities with different motivations utilize Bitcoin, including individuals, government organizations, corporations, service providers, and criminal organizations. Extensive research endeavors in Bitcoin study the behaviors and dynamics of value transfers among these varied entities. Such studies provide critical insights into the purposes and motivations driving Bitcoin usage. Bitcoin users are represented by randomized addresses, and information from the blockchain alone is inadequate for discerning the true identity or nature of the entity denoted by an alias. At no point does the labeling process, nor the final labels, reveal information about individual humans, thus alleviating privacy concerns.
Entities often scrutinized in prior research include individual users connected to illicit or criminal activities, such as Ponzi schemes, ransomware operators, or mixers6,9,22. Other examined entities encompass participants in Bitcoin’s economic activities, such as miners, exchanges, marketplaces, and faucets, along with those associated with entertainment, such as sports betting and gambling platforms7,8,23. With the proliferation of the crypto-economy beyond Bitcoin, we propose studying ‘bridges’ that enable value transfers among different crypto-economies. Consequently, we aim to focus our investigation on the following entity categories:
-
Individual
-
Mining: individuals or entities that validate and confirm transactions on the Bitcoin network.
-
Exchange: online platforms facilitating the buying, selling, and trading of cryptocurrencies and fiat currencies.
-
Marketplace: online platforms where users can purchase and sell goods or services using bitcoin as the primary payment method.
-
Gambling: online platforms allowing users to wager and play casino games, sports betting, and engage in lotteries using Bitcoin.
-
Bet: address created by a gambling service specifically for receiving funds related to a unique bet.
-
Faucet: promotional tools rewarding users with small amounts of bitcoin for completing tasks or viewing advertisements.
-
Mixer: services that enhance the privacy and anonymity of transactions by obfuscating transaction trails on the blockchain.
-
Ponzi: financial schemes promising high returns to investors financed by funds from newer investors.
-
Ransomware: malicious software that encrypts files on a victim’s computer, demanding a ransom for decryption.
-
Bridge: protocols facilitating asset exchanges between Bitcoin and other blockchain networks (e.g., Ethereum).
These entities were chosen for their relevance and commonality within the cryptocurrency ecosystem, delivering an extensive overview of the various actors in the Bitcoin landscape.
Summary
In this experimental framework for Bitcoin research, we utilize BitcoinTalk (https://bitcointalk.org), a prominent online forum, to extract and analyze Bitcoin-related data. Employing a Python-based scraper with Selenium (https://www.selenium.dev), we systematically collected 14 million messages from 546,000 threads, concentrating on posts that mention Bitcoin addresses. These addresses were subsequently linked to entities (e.g., services, companies) using ChatGPT (https://openai.com/chatgpt), a large language model fine-tuned for contextual comprehension. ChatGPT was instructed to identify deposit addresses, hot/cold wallets, and withdrawal transactions based on post content, transaction IDs, and USD amounts converted using the Bitstamp exchange rate. The complete labeling pipeline is illustrated in Fig. 3. This method enabled the labeling of 34,000 nodes and 100,000 Bitcoin addresses with entity types, such as ransomware operators or Ponzi schemes, by categorizing forum discussions into predefined categories. However, the dataset does have limitations, including potential inaccuracies from user-generated content, biases towards English-speaking entities, and difficulties in extracting precise information from unstructured text. Despite these constraints, ChatGPT demonstrated high accuracy (83-96%) in extracting relevant data from forum posts. This methodology presents a scalable, automated pipeline for constructing a large-scale, labeled Bitcoin transaction graph, paving the way for advanced research into transaction patterns, entity behaviors, and denoted malicious activities within the Bitcoin ecosystem. The integration of forum data and AI-driven labeling introduces a novel approach for addressing the scarcity of curated datasets in blockchain research.
BitcoinTalk
BitcoinTalk is an online forum dedicated to Bitcoin, remaining one of the most vibrant discussions on the subject. The forum consists of various sections, subsections, and threads. A thread is a sequence of messages or posts that should pertain to the thread’s topic or title. Bitcoin addresses are frequently mentioned in posts, and the contextual nature of the thread can facilitate attributing these addresses to entities, such as services or companies. We have developed a Python-based scraper using the Selenium (https://www.selenium.dev) package to systematically collect posts from the forum’s English-speaking section. This section typically features threads with numerous messages distributed over multiple pages. The scraper commences its operation by accessing the first page of a thread to extract its posts and continues to navigate through subsequent pages to retrieve all remaining posts. For every post, the scraper captures the text, a unique author identifier, and the date of publication. Data collection was completed at the end of 2023, yielding a total of 14,067,713 messages from 546,440 threads.
ChatGPT
Addresses were assigned to entity names using ChatGPT, an artificial intelligence assistant developed by OpenAI based on the GPT24 foundation models. ChatGPT is designed to engage in human-like automated dialogues with users. The GPT models are fine-tuned through supervised learning and reinforcement learning using human feedback. A conversation comprises a sequence of user prompts and assistant responses. ChatGPT has shown remarkable capabilities in various tasks, including adherently following instructions and solving logical problems25. For this purpose, we employed ChatGPT (model ‘gpt-4o-mini’) through API calls (https://platform.openai.com).
Deposit Addresses, Hot and Cold Wallets
We focused on addresses belonging to organizations, particularly those offering services in exchange for bitcoins. Transactions between user addresses and service addresses are fairly common across various platforms26. To access these services, users deposit funds by transferring bitcoins from personal addresses to those managed by the organization, known as deposit addresses27. Organizations usually generate unique deposit addresses for each client, simplifying the monitoring of client deposits while keeping control over these addresses. After users deposit funds from their personal addresses to the deposit addresses, they can utilize the services in exchange for the bitcoins deposited. Users have the option to withdraw any remaining bitcoins back to personal addresses once they finish using the service. Typically, funds from deposit addresses are consolidated into two types of addresses: hot wallets and cold wallets28. Hot wallets are online and contain sufficient funds for routine operations, such as user withdrawals. In contrast, cold wallets are typically offline to protect against online threats, storing the majority of the service’s funds and user deposits. Based on common interaction patterns and data gleaned from collected posts, instructions were formulated for ChatGPT to extract information about the addresses mentioned in those posts.
Prompts
We devised several prompts to guide ChatGPT in correlating addresses mentioned in a post with an entity name when contextually permissible. Along with the textual information in the post, we included supplementary data. Certain posts mention transaction IDs, which serve as unique identifiers for transactions on the blockchain. This allows readers to retrieve detailed transaction data from the blockchain, offering additional context. Since this detailed transaction information can also assist ChatGPT, we added transaction details (senders, recipients, amounts) related to the mentioned transaction IDs into the prompts. Although blockchain amounts are denoted in satoshis, users frequently refer to transaction amounts in USD in their posts. To aid ChatGPT in aligning Bitcoin amounts in transactions with USD amounts discussed in the posts, we included the converted USD amounts based on the BTC/USD exchange rate from the date of the post. This daily exchange rate was sourced from Bitstamp through their official API (https://www.bitstamp.net/api).
All prompts are accessible in the code directory (see ‘Code Availability’). The first script is designed to detect Bitcoin deposits from customers to a service. Posters often mention deposit addresses or transaction IDs when experiencing issues with their deposits, such as their accounts not reflecting the correct Bitcoin amount. This script attributes the mentioned deposit addresses to entity names when the context allows. The second script targets user withdrawals. Users aiming to withdraw funds provide a recipient address to the service, which then creates the transaction and communicates the transaction ID upon confirmation on the blockchain. Withdrawal transactions are typically funded by the service’s funds, likely from hot wallets, leading us to categorize the sending addresses of such transactions as owned by the identified service. The third script is similar; it also attempts to detect withdrawals while attempting to identify the involved entity, withdrawal address, and withdrawn amount. We refer to the earlier case by searching the blockchain for the corresponding withdrawal transaction around the post’s date (+/- three days), with the withdrawal address receiving the specified amount. If a unique transaction meets these criteria, we assume it to be the withdrawal transaction and attribute its funding addresses to the specified entity. Lastly, the final script identifies hot and cold addresses under various contexts.
Labeling
Prompts have been carefully constructed to ensure the assistant returns a concise reasoning alongside an entity name and Bitcoin addresses or transaction IDs. We established a mapping between the returned entity names and predefined entity categories. This mapping process utilized threads from BitcoinTalk that mentioned entity names and the Internet Wayback Machine (https://web.archive.org). If the name of an entity was unknown, often due to services having ceased operations or lacking prominence, we manually navigated BitcoinTalk forum threads referencing the entity to deduce the type of service offered. When posts contained URLs, we attempted to access the corresponding websites for additional context. For websites that were no longer operational, we utilized the Internet Wayback Machine to retrieve historical snapshots. If the entity type could not be determined post-procedure, the pair was removed from the dataset to maintain label integrity and reliability. For each labeled address, we identified and labeled the locking scripts associated with the address, providing the same label to these scripts. Subsequently, for each labeled script, we assigned the corresponding cluster/node the same label. In cases where a cluster has conflicting labels, no label was assigned.
Limits
Posts obtained from BitcoinTalk may contain inaccuracies, misinformation, or intentional falsehoods put forth by users. Users might misrepresent transaction contexts or deliberately misleading information regarding the ownership or purpose of certain addresses. These inaccuracies could arise from the pseudonymous nature of forum users, making it difficult to authenticate claims. Furthermore, the dataset may reflect biases due to its reliance on self-reported or community-shared information, potentially highlighting certain entities or behaviors while omitting others.
Moreover, considering that we only accessed posts from the English-speaking segment of the forum, the constructed dataset of labels may not truly reflect the entirety of entities worldwide. This language limitation might result in an underrepresentation of non-English-speaking users and entities, thus skewing node labels towards regions and communities that are more active in English-language discussions.
Additionally, despite advancements in large language models like ChatGPT, extracting information from brief, unstructured text remains a substantive challenge. Online discussions often lack clarity, consist of ambiguous references, or showcase inconsistent formatting, making precise retrieval difficult. To gauge the model’s capabilities in extracting desired information, we conducted an evaluation using samples from all available posts. Specifically, for each prompt, we selected 150 posts containing at least one Bitcoin address or transaction identifier. The evaluation metric was the ratio of posts from which all relevant details were accurately identified and extracted. For the deposit address extraction prompt, we targeted posts containing keywords such as ‘deposit’, ‘deposited’, ‘transfer’, and ‘transferred’ to maximize address relevance. Our analysis revealed ChatGPT successfully extracted complete information from 138 of the 150 posts (92%). For withdrawal address and transaction prompts, we selected posts mentioning ‘withdraw’, ‘withdrew’, ‘withdrawn’, and ‘withdrawal’. The outcomes exhibited high accuracy, yielding 96% for withdrawal transactions and 86% for withdrawal addresses. For the extraction of hot and cold wallet addresses, we targeted posts containing the terms ‘hot’ and ‘cold’, with ChatGPT correctly extracting details from 83% of these posts. These results indicate the effectiveness of ChatGPT in extracting information from forum posts.
Other Sources
To further enhance our dataset, we included additional data from various sources.
-
We incorporated wallet addresses from different cryptocurrency exchanges as provided by the exchanges themselves. Several exchanges publicly disclose the addresses for their hot and cold wallets to promote transparency and demonstrate the custody of customer funds. These addresses were manually collected from the official websites of the respective exchanges. The URLs where these addresses were found are listed in Table 3. The addresses can either be directly procured from the page or included in a downloadable file accessible via the page.
Table 3: Sources of Cryptocurrency Exchange Wallet Addresses. -
We gathered 29 addresses from DefiLlama (https://defillama.com/cexs), particularly those associated with exchanges such as Coinsquare, Gate.io, Swissborg, Latoken, Woo, and Cakedefi. These addresses can be accessed by selecting the exchange on the DefiLlama CEX page and then clicking on the Wallet Addresses button, which redirects to a GitHub page containing the respective addresses.
-
To broaden our collection of ransomware addresses, we incorporated addresses uncovered in previous research papers, including the two Padua ransomware datasets (https://spritz.math.unipd.it/projects/btcransomware/)6 and the Montréal ransomware dataset (https://doi.org/10.5281/zenodo.1238041)10,29. These addresses were collated from public sources such as security reports, academic publications, and online databases that catalog Bitcoin addresses linked to illicit activities.
-
We also included addresses from the Specially Designated Nationals (SDN) List (https://sanctionslist.ofac.treas.gov/Home/SdnList) curated by the U.S. Department of Treasury. The addresses are provided in the ‘SDN.XML’ file. Selected entities from this list encompass ‘Suex’, ‘Chatex’, and ‘Garantex’ (all designated as exchanges), ‘Hydra’ (marketplace), and ‘Blender.io’ (mixer).
-
Additionally, we incorporated addresses holding bitcoins associated with the bridge Wrapped Bitcoin (https://wbtc.network/dashboard/audit), categorizing them as ‘bridge’.
-
Bitcoin addresses were also sourced from user profiles on BitcoinTalk. Forum users often display their personal Bitcoin addresses in profiles or signatures below their posts. We scraped the profiles of all forum participants based on previously collected messages, labeling each identified address as ‘individual’.
-
We further included addresses tied to mining entities. Miners receive rewards for integrating new transactions into the blockchain through specific operations known as Coinbase transactions. Miners can embed a message in these transactions; certain mining companies may include their name or a unique pattern. These messages help to identify and classify these addresses as ‘mining’.
-
Lastly, we enriched our dataset with addresses linked to betting and gambling platforms. Some platforms, like Fairlay and DirectBet, enable customers to participate in bets by sending bitcoin to uniquely designated addresses for each bet. Users often share their bets on forums such as BitcoinTalk via URLs corresponding to the bets, which include Bitcoin deposit addresses. We devised regex patterns to detect these URLs, extract addresses, and classify them as ‘bet’.
All addresses gathered in this subsection, with the exception of the 29 addresses from DefiLlama, are included in the dataset (see ‘Data Records’). However, these excluded addresses can still be accessed through the provided links.