Jorge's Tutorials 🏠

Click the house emoji to go back home
***GitHub***

Introduction

This tutorial will pull BSC transactions from an address, filter them by date, add contract information, and saves the data to a CSV file. This will assist with keeping track of the cost basis of the coins and different types of transactions. We will use the BscScan API and CoinGecko API to acquire all this information. I also recommend using a DeFi Portfolio Tracking website such as DeBank to compare to the final CSV file this Jupyter Notebook outputs as it might not be perfect. One can open up the CSV file and add additional information to complete tracking their transactions. This is for BSC but EtherScan and PolyScan have very similar endpoints that can be used. Be sure to create accounts for both of the APIs mentioned above as you need an API Key to pull data.

**Disclaimer: I do not provide personal investment advice and I am not a qualified licensed investment advisor. The information provided may include errors or inaccuracies. Conduct your own due diligence, or consult a licensed financial advisor or broker before making any and all investment decisions. Any investments, trades, speculations, or decisions made on the basis of any information found on this site and/or script, expressed or implied herein, are committed at your own risk, financial or otherwise. No representations or warranties are made with respect to the accuracy or completeness of the content of this entire site and/or script, including any links to other sites and/or scripts. The content of this site and/or script is for informational purposes only and is of general nature. You bear all risks associated with the use of the site and/or script and content, including without limitation, any reliance on the accuracy, completeness or usefulness of any content available on the site and/or script. Use at your own risk.**

Loading Libraries

This cell loads the libraries that are necessary to run this notebook. We use the requests library to communicate with the BscScan and CoinGecko APIs. We use the json library to print the data we pulled in a format that can be easily read to understand how the data is organized. The datetime library is used to format date data, pandas is used to create the Pandas Dataframe containing all the data we pulled, numpy is used to manipulate some of the data, time is used to creating a waiting period between each API request as there is a time limit, yaml is used to safely load the wallet address of interest and API Key, and os is used to create a directory for the transaction year selected.

In [ ]:
import requests
import json
from datetime import datetime
import pandas as pd
import numpy as np
import time
import yaml
import os

YAML and Dates

Create a YAML config file named config.yml with keys for your address and apikey in the format seen below. We safely load the YAML file config.yml containing the necessary information and select a year which will determine the start_date (January of that year) and end_date (January of the next year) to filter out the transactions. A directory will then be created for that year to store the multiple CSV files we will be creating throughout this notebook.

address: '0x...' # must be lowercase
apikey: 'ABC..'
In [ ]:
config = yaml.safe_load(open('config.yml')) # Load yaml file that has API Key and Address
address = config['address'] # Address key from yaml file
apikey = config['apikey'] # API key from yaml file

# Our Free API has a rate limit of 50 calls/minute but add a margin of safety just in case
wait = 5 # in seconds

# Date selection
year = '2022'
start_date = f'{year}-01-01'
end_date = f'{str(int(year)+1)}-01-01'

# Create directory for selected year
if not os.path.exists(year):
    os.makedirs(year)

BEP-20 Token Transfer Events

Gathering Data: Transactions

Here we will use the Get a list of 'BEP-20 Token Transfer Events' by Address endpoint to acquire ALL the BEP-20 token transfers made by an address (we will filter them by date in the next cell). The data is already in the JSON format so it can be turned into a Pandas Dataframe right away.

In [ ]:
# BEP-20 Token Transfer Events by Address endpoint
url=f'https://api.bscscan.com/api?module=account&action=tokentx&address={address}&startblock=0&endblock=999999999&sort=desc&apikey={apikey}'

# API Response with data
response = requests.get(url)
response_bep20 = response.json()

# Uncomment to see data format
# print(json.dumps(response_bep20, indent=4))
# print(json.dumps(response_bep20['result'][0], indent=4))

# Transactions are already in JSON format so can they can be passed into Pandas Dataframe directly
bep20_txs = pd.DataFrame(response_bep20['result'])

Cleaning Up: Filter by Date and Show tokens

Here we create a couple of additional columns in order to make the data more human readable and then filter the transactions by the date bounds provided above (start_date and end_date). Some high-level transaction information is also printed for a quick overview but the more important set of information is the All tokens print out which can be used to filter out spam tokens in the next cell.

  • The dateTime column converts the Unix timeStamp to an actual time and date to make it human readable.
  • The valueReal column converts the number of tokens from ether units to wei units which is the actual number of tokens in the transactions to make it human readable (here is a converter).
  • The gasFeeInBNB column is created to calculate the gas value in BNB from the gasPrice (in wei) and gasUsed columns. The other gas columns in the Pandas Dataframe aren't used since the gas column is the gas limit that could be used for this transaction and cumulativeGasUsed column is the gas for the whole block.
In [ ]:
# Creation of more human readable columns
bep20_txs['dateTime']=pd.to_datetime(bep20_txs['timeStamp'], unit='s')
bep20_txs['valueReal']=bep20_txs['value'].astype(float)*10**-18
bep20_txs['gasFeeInBNB']=bep20_txs['gasPrice'].astype(float)*10**-18*bep20_txs['gasUsed'].astype(float)

# Create a mask to filter out desired dates
mask_date=(bep20_txs['dateTime']>start_date) & (bep20_txs['dateTime']<end_date)
bep20_txs=bep20_txs.loc[mask_date]

# Transaction Information
print('---------- Transaction Dates ----------')
print('Oldest Transaction: ',bep20_txs.iloc[-1,bep20_txs.columns.get_loc('dateTime')])
print('Latest Transaction: ',bep20_txs.iloc[0,bep20_txs.columns.get_loc('dateTime')])

print('---------- Total Transactions ----------')
print(bep20_txs.shape[0])

print('---------- All tokens ----------')
print(bep20_txs['tokenName'].unique()) # Use this to filter out spam tokens in the next cell.                         

Cleaning Up: Filter Out Spam Tokens

Now that we have a list of all the tokens in our transactions, we can filter out spam tokens by adding the tokens that are potentially spam to the spamtokens list below. Be careful as some of these token names could be slightly different due to the types of transactions such as LP tokens. We then save this new Pandas Dataframe as a CSV file which we will be doing a lot throughout this notebook since the API Calls can take some time and we don't want to be calling the API repeatedly to recreate the Pandas Dataframe.

In [ ]:
spamtokens= ['Def8.io' , 'Dex88.org' , 'AGMC.io' , 'Swap7.org' , 'Tu7.org' , 'LinkP.io' , 'Zepe.io' , 'BestAir.io' , 'bonuswallet.org' , 'BSCTOKEN.IO' , 'TheEver.io' , 'ALPACAFIN.COM' , 'TheVera.io' , 'SAFEMOON.is' , 'HOTGraph Token' , 'Minereum BSC' , 'Wis.com' , 'Tadpole' , 'Godzilla' , 'Doge DeFi' , 'BUTTON' , 'CTRS' , 'ANY Ethereum' , 'Monki' , 'ONXswap.com' , 'Wall Street Games' , 'APPLEB' , 'TokenPocket Token' , 'luna2.app' , 'SpacePi Token' , 'Crypto Gold Box', '0Apply Rich Token' , '1stake.io' , 'GGBoxs.com' , 'AgileSwap.io' , 'Gambling-Crypto.games' , '1Gas.org', 'bitman.trade app token' ]

# Create a mask to filter out unwanted spam tokens
mask_spam=(~bep20_txs['tokenName'].isin(spamtokens))
bep20_txs=bep20_txs.loc[mask_spam]

# Transaction Information without spam tokens
print('---------- Total Transactions after Filter ----------')
print(bep20_txs.shape[0])

print('---------- Filtered tokens ----------')
print(bep20_txs['tokenName'].unique())

# Drop previous index now that spam tokens have been filtered
bep20_txs=bep20_txs.reset_index(drop=True)

# Save new Pandas Dataframe to a csv file
bep20_txs.to_csv(os.path.join(year,f'{year}_bep20_txs.csv'))

Contract Information

The next step is to gather more information on the different contracts by using the contractAddress column and primarily calling the CoinGecko Contract endpoint since this provides a lot more information than the BscScan Get Contract Source Code for Verified Contract Source Codes endpoint. However, we will still use the BscScan endpoint as a backup just in case the CoinGecko endpoint doesn't provide information. This information will provide details on what is the contract address name, symbol, homepage, twitter, etc... in order to get a better feel for its legitimacy and will later be used to acquire the price of that coin as well.

Gathering Data: Contracts

We first extract the unique contract addresses from the bep20_txs Pandas Dataframe previously created and then call the contract endpoints in a loop over these unique contracts to gather their information. Here we use the Python Dictionary method dictionary.get(key) to check if a specific key exists. If it doesn't exist, it will return None rather than an error if dictionary[key] is used. This is useful because some of the information might not be available for all the different contract addresses and we don't want to error out during the endpoint call loop since the data is saved after the loop is finished.

The CoinGecko endpoint is called first but if the response status code isn't 200 (200 meaning successful response), then the BscScan endpoint is called to see if it might have some information on the contract. A temporary dictionary is created for each contract and is appended to the contracts Pandas DataFrame which is then saved to a separate csv file. Note how we created a key titled contractAddress (which will be turned into a column in the dataframe) since we will use this to merge with the dataframe we created earlier which has a column by the same name.

In [ ]:
# Filter out unique contract addresses
unique_contracts=bep20_txs['contractAddress'].unique()
print('---------- Unique Contract Addresses ----------')
print(len(unique_contracts))

d={} # Temporary Dictonary
contracts=pd.DataFrame() # Pandas Dataframe for the contracts

# Loop over the unique contract addresses
for i, contract in enumerate(unique_contracts):

    print(f'Contract {i+1} out of {len(unique_contracts)}')

    # API Endpoint
    url=f'https://api.coingecko.com/api/v3/coins/binance-smart-chain/contract/{contract}'
    response = requests.get(url)
    contract_info = response.json()
    
    # Uncomment to see data format
    # print(json.dumps(contract_info, indent=4))

    # Successful response from CoinGecko
    if response.status_code==200: 

        # Fill in temporary dictionary with contract information
        d={'contractAddress':contract, # name them the same for later merge
            'contract_id':contract_info.get('id','None'),
            'contract_symbol':contract_info.get('symbol','None'),
            'contract_name':contract_info.get('name','None'),
            'contract_description':contract_info.get('description','None').get('en','None'),
            'contract_homepage':contract_info.get('links','None').get('homepage','None')[0:1], # slicing bypasses error if list is empty
            'contract_blockchain_site':contract_info.get('links','None').get('blockchain_site','None')[0:1],
            'contract_twitter_screen_name':'twitter.com/'+contract_info.get('links','None').get('twitter_screen_name','None'),
            'contract_github':contract_info.get('links','None').get('repos_url','None').get('github','None')[0:1],
            'contract_image':contract_info.get('image','None').get('thumb','None'),
            'contract_country_origin':contract_info.get('country_origin','None'),
          }

    # Unsuccessful response from CoinGecko so call BSC API
    else:

        # API Endpoint
        url=f'https://api.bscscan.com/api?module=contract&action=getsourcecode&address={contract}&apikey={apikey}'
        response = requests.get(url)
        contract_info = response.json()

        # Uncomment to see data format
        # print(json.dumps(contract_info, indent=4))

        # Fill in temporary dictionary with contract information, missing columns will be NaN
        d={'contractAddress':contract, # name them the same for later merge
            'contract_name':contract_info.get('result','None')[0].get('ContractName','None'),
          }

    # Add temporary dictionary to contracts Pandas Dataframe
    contracts=contracts.append(d, ignore_index=True)

    time.sleep(wait) # In seconds so API doesn't block our requests

# Save new Pandas Dataframe to a csv file
contracts.to_csv(os.path.join(year,f'{year}_contracts.csv'))

Cleaning Up: Merging

We then merge the two dataframes based on the contractAddress column to add the contract information data from the Pandas Dataframe contracts to the transaction Pandas Dataframe bep20_txs. We create a new Pandas Dataframe bep20_txs_contracts to execute the merging by using the Pandas Dataframe merge method and calling out the Pandas Dataframes and their columns of interest.

The left Pandas Dataframe is the one using the merge method (bep20_txs_contracts.merge()) and the right Pandas Dataframe is the one called in said method (contracts). The left and right arguments correspond to those two Pandas Dataframes respectively and it is important to use how='left' since we want to keep the left Pandas Dataframe bep20_txs and just copy over the information from the right Pandas Dataframe contracts. This will add the contract information to each row in the transaction Pandas Dataframe.

In [ ]:
# Read in the saved csv files so we don't have to start over with the API calls
    # make sure to call out index_col so the merge doesn't create extra 0_x, 0_y columns
contracts=pd.read_csv(os.path.join(year,f'{year}_contracts.csv'),index_col=0) 
bep20_txs=pd.read_csv(os.path.join(year,f'{year}_bep20_txs.csv'),index_col=0)

# Create a copy of the Pandas Dataframe for the merge
bep20_txs_contracts=bep20_txs.copy()

# Create the merge in the new dataframe
bep20_txs_contracts=bep20_txs_contracts.merge(contracts, left_on='contractAddress', right_on='contractAddress', how='left')

# Save new Pandas Dataframe to a csv file
bep20_txs_contracts.to_csv(os.path.join(year,f'{year}_bep20_txs_contracts.csv')) 

Internal Transactions

Some transactions have transactions within them due to how the different contracts interact with one another so we use the BSC Get 'Internal Transactions' by Transaction Hash endpoint to understand which transactions have internal transactions. This reveals any coins that are hidden because of the internal transactions.

Gathering Data: Internal Transactions

We loop through all the transactions using the Pandas Dataframe .iterrows() method to try and find any internal transactions for each transaction tx = row['hash'], add all the internal transactions' token values up valueCount, and assign it to the last internal transaction of that original transaction as the column intTxvalueTotal. We need to create a temporary index count since there could be multiple internal transactions per transaction and we are going to merge based on the index in the next section.

In [ ]:
# Read in the saved csv files so we don't have to start over with the API calls
bep20_txs_contracts=pd.read_csv(os.path.join(year,f'{year}_bep20_txs_contracts.csv'))

int_txs=pd.DataFrame() # Pandas Dataframe for the internal transactions

# Temporary index since there could be multiple internal transactions per transaction  
    # need this so the merge doesn't get confused later on since multiple rows for hashes
count=0

# Loop over transactions
for i, row in bep20_txs_contracts.iterrows():

    print(f'Transaction {i+1} out of {bep20_txs_contracts.shape[0]}')

    tx = row['hash'] # get transaction hash

    # API Endpoint
    url=f'https://api.bscscan.com/api?module=account&action=txlistinternal&txhash={tx}&apikey={apikey}'
    response = requests.get(url)
    internal = response.json()

    # If there are no internal transactions
    if internal['message']=='No transactions found':

        # Fill in Pandas Dataframe
        int_txs.loc[count,'intTxoriginalIndex']=int(i)
        int_txs.loc[count,'hash']=tx
        int_txs.loc[count,'intTx']='No'

        print('No Internal Transactions')

        count+=1

    # If there are internal transactions
    elif internal['message']=='OK':

        print('Yes Internal Transactions')

        valueCount=0 # For counting up the total value of internal transactions

        # Loop through all internal transactions for that specific transaction
        for i, result in enumerate(internal['result']):

            print(f'\tInternal Transaction {i+1}')
            print(f'\t{result}')

            # Fill in Pandas Dataframe
            int_txs.loc[count,'intTxoriginalIndex']=int(i)
            int_txs.loc[count,'hash']=tx
            int_txs.loc[count,'intTx']='Yes'
            int_txs.loc[count,'intTxtimeStamp']=result['timeStamp']
            int_txs.loc[count,'intTxfrom']=result['from']
            int_txs.loc[count,'intTXto']=result['to']
            int_txs.loc[count,'intTxvalue']=result['value']
            count+=1

            # Count up value of internal transactions
            if result['to']== address:
                valueCount=valueCount+float(result['value'])
        
        # Adding total value to last internal transaction of that specific transaction
        int_txs.loc[count-1,'intTxvalueTotal'] = valueCount # -1 since it already got added so don't want to go onto the next one 



    time.sleep(wait) # In seconds so API doesn't block our requests

# Converting columns and filling in na with 0 so it won't complain about na values for maths
int_txs['intTxdateTime']=pd.to_datetime(int_txs['intTxtimeStamp'], unit='s')
int_txs['intTXvalueReal']=int_txs['intTxvalue'].fillna(0).astype(float)*10**-18
int_txs['intTXvalueTotalreal']=int_txs['intTxvalueTotal'].fillna(0).astype(float)*10**-18

# Save new Pandas Dataframe to a csv file
int_txs.to_csv(os.path.join(year,f'{year}_int_txs.csv'))

Cleaning Up: Merging

Here we look at all the internal transactions for each transaction and select the internal transaction that does not have a total value intTXvalueTotalreal equal to zero. We take this final internal transaction which is one line (Pandas Dataframe int_txs) and merge it to the original transaction (Pandas Dataframe bep20_txs_contract) with the temporary index intTxoriginalIndex as a new Pandas Dataframe bep20_txs_contracts_internal.

In [ ]:
# Read in the saved csv files so we don't have to start over with the API calls
    # make sure to call out index_col so the merge doesn't create extra 0_x, 0_y columns
bep20_txs_contracts=pd.read_csv(os.path.join(year,f'{year}_bep20_txs_contracts.csv'), index_col=0)
int_txs=pd.read_csv(os.path.join(year,f'{year}_int_txs.csv'), index_col=0)

# Filter out internal transactions without a total value
int_txs_final=int_txs.loc[int_txs['intTXvalueTotalreal']!=0]
print(int_txs_final[['hash','intTXvalueTotalreal']])

# Drop the hash so the merge doesn't create hash_x and hash_y hence the need for the temporary index
int_txs_final= int_txs_final.drop(columns='hash')

# Create a copy of the Pandas Dataframe for the merge
bep20_txs_contracts_internal=bep20_txs_contracts.copy()

# Create the merge in the new dataframe
bep20_txs_contracts_internal=bep20_txs_contracts_internal.merge(int_txs_final, left_on=bep20_txs_contracts_internal.index, right_on='intTxoriginalIndex', how='left')

# Fill in empty intTx rows since we created a subset of information and excluded all the Nos in the previous lines
bep20_txs_contracts_internal.loc[pd.isna(bep20_txs_contracts_internal['intTx']),'intTx']='No'

# Save new Pandas Dataframe to a csv file
bep20_txs_contracts_internal.to_csv(os.path.join(year,f'{year}_bep20_txs_contracts_internal.csv'))

Trade Prices

The final use of the APIs is to gather price information for each coin and BNB for each transaction. We use the closest date to dateTime as the price estimation and create extra columns to manually check the price in case the date chosen is too far off the transaction date. We use the CoinGecko Coin Market Chart Range endpoint to gather all this information.

Gather Data

The free API price history has a resolution of one day so the date chosen will never be off more than one day. We loop through the transactions and use the endpoint to find the price of the contract_id. If the contract_id is found and the endpoint returns a list of prices and corresponding dates, we then calculate which date is closest to the trade date dateTime and select that price for contract_id_price. If the contract_id cannot be found or the list of prices is empty we fill in contract_id_price with 'Coin not found' or 'Prices not found' respectively. We then create a couple of extra columns containing 'TRUE_CHECK' to manually check the prices in case the dates are too far off. We also create a temporary index tradePricesoriginalIndex to avoid the merge method adding a suffix to the transaction hash in the next cell.

In [ ]:
# Read in the saved csv files so we don't have to start over with the API calls
bep20_txs_contracts_internal=pd.read_csv(os.path.join(year,f'{year}_bep20_txs_contracts_internal.csv'),index_col=0)

trade_prices=pd.DataFrame() # Pandas Dataframe for the internal transactions

# Loop through transactions
for i, row in bep20_txs_contracts_internal.iterrows():

    print(f'Transaction {i+1} out of {bep20_txs_contracts_internal.shape[0]}')

    contract_id=row['contract_id'] # The coin for the specific transaction

    trade_prices.loc[i, 'tradePricesoriginalIndex']=int(i) # Temporary index to merge later
    trade_prices.loc[i, 'hash']=row['hash'] # Transaction hash to double check

    # If the contract id exists meaning contract info was acquired
    if pd.notna(contract_id):

        print('Contract found')

        # Acquiring time and creating timeframes to find closest data point
        tradedate=pd.to_datetime(row['dateTime']).timestamp()
        start=tradedate-1*60*60 # Only need one hour
        end=tradedate+1*60*60 # Only need one hour

        print('Trade Date: ',row['dateTime'])
        print('Start Date: ',pd.to_datetime(start, unit='s'))
        print('End Date: ',pd.to_datetime(end, unit='s'))

        # API Endpoint
        url=f'https://api.coingecko.com/api/v3/coins/{contract_id}/market_chart/range?vs_currency=usd&from={str(int(start))}&to={str(int(end))}'
        response = requests.get(url)
        prices = response.json()

        # If there is NO error key then continue
        if prices.get('error','CONTINUE')=='CONTINUE':

            # If prices is not empty
            if prices['prices']:
                
                # Grab prices array and create a temporary mini Pandas Dataframe
                date_prices=np.array(prices['prices'])
                d={'timeStamp':date_prices[:,0],'price':date_prices[:,1]}
                date_prices_df=pd.DataFrame(data=d)
                date_prices_df['dateTime']=pd.to_datetime(date_prices_df['timeStamp'],unit='ms')
                
                # Find which date is closest to the transaction and use that corresponding price
                diff=abs(date_prices_df['timeStamp']/1000-tradedate) # This timeStamp is in ms so convert to seconds by dividing by 1000
                min_diff=diff.idxmin()
                closest=date_prices_df.iloc[min_diff]['dateTime']
                closestPrice=date_prices_df.iloc[min_diff]['price']
                min_diff_sec=diff[min_diff]

                print('Closest Trade: ', closest, ' Price: ',closestPrice ,' Minutes Off',min_diff_sec/60)

                trade_prices.loc[i, 'contract_id_price'] =closestPrice
                trade_prices.loc[i, 'contract_id_closest_dateTime'] =closest
                trade_prices.loc[i, 'contract_id_closest_dateTime_diff_(min)'] = min_diff_sec/60

            # If prices is empty
            else:

                trade_prices.loc[i, 'contract_id_price'] = 'Prices not found'
                trade_prices.loc[i, 'contract_id_closest_dateTime'] = 'Prices not found'
                trade_prices.loc[i, 'contract_id_closest_dateTime_diff_(min)'] =  'Prices not found'
        
        # If there is an error key so the coin was not found
        else:

                trade_prices.loc[i, 'contract_id_price'] = 'Coin not found'
                trade_prices.loc[i, 'contract_id_closest_dateTime'] = 'Coin not found'
                trade_prices.loc[i, 'contract_id_closest_dateTime_diff_(min)'] =  'Coin not found'
        
        time.sleep(wait) # In seconds so API doesn't block our requests
        

        ###################
        # BNB Price Check #
        ###################  

        coinid='wbnb'

        # API Endpoint
        url=f'https://api.coingecko.com/api/v3/coins/{coinid}/market_chart/range?vs_currency=usd&from={str(int(start))}&to={str(int(end))}'
        response = requests.get(url)
        prices= response.json()

        # Grab prices array and create a temporary mini Pandas Dataframe
        date_prices=np.array(prices['prices'])
        d={'timeStamp':date_prices[:,0],'price':date_prices[:,1]}
        date_prices_df=pd.DataFrame(data=d)
        date_prices_df['dateTime']=pd.to_datetime(date_prices_df['timeStamp'],unit='ms')

        # Find which date is closest to the transaction and use that corresponding price
        diff=abs(date_prices_df['timeStamp']/1000-tradedate) #this timeStamp is in ms so convert to seconds by dividing by 1000
        min_diff=diff.idxmin()
        closest=date_prices_df.iloc[min_diff]['dateTime']
        closestPrice=date_prices_df.iloc[min_diff]['price']
        min_diff_sec=diff[min_diff]

        print('Closest Trade BNB: ', closest, ' Price: ',closestPrice ,' Minutes Off',min_diff_sec/60)

        trade_prices.loc[i, 'bnb_price'] =closestPrice
        trade_prices.loc[i, 'bnb_closest_dateTime'] =closest
        trade_prices.loc[i, 'bnb_closest_dateTime_diff_(min)'] =min_diff_sec/60


        time.sleep(wait) # In seconds so API doesn't block our requests


    # No contract information found
    else:

        print('Contract not found')
        
        continue


# Adding columns in case it is too off so we can manually check it
trade_prices['contract_id_price_TRUE_CHECK']=pd.NaT
trade_prices['contract_id_closest_dateTime_TRUE_CHECK']=pd.NaT
trade_prices['bnb_price_TRUE_CHECK']=pd.NaT
trade_prices['bnb_closest_dateTime_TRUE_CHECK']=pd.NaT

# Save new Pandas Dataframe to a csv file
trade_prices.to_csv(os.path.join(year,f'{year}_trade_prices.csv'))

Cleaning Up: Merging

We merge the two Pandas Dataframes (bep20_txs_contracts_internal and trade_prices) using the indices to create a new Pandas Dataframe (bep20_txs_contracts_internal_prices). We also do some maths in order to calculate the total price in USD of the transaction and not just of the coin itself. We do this by using the Pandas Dataframe Dataframe.apply() method and adding logic to capture the 'Prices not found' and 'Coin not found' exceptions. We apply this with axis = 1 since we want to do row-wise operations.

In [ ]:
# Read in the saved csv files so we don't have to start over with the API calls
bep20_txs_contracts_internal=pd.read_csv(os.path.join(year,f'{year}_bep20_txs_contracts_internal.csv'),index_col=0)
trade_prices=pd.read_csv(os.path.join(year,f'{year}_trade_prices.csv'),index_col=0)

# Drop the hash so the merge doesn't create hash_x and hash_y hence the need for the temporary index
trade_prices = trade_prices.drop(columns='hash')

# Create a copy of the Pandas Dataframe for the merge
bep20_txs_contracts_internal_prices=bep20_txs_contracts_internal.copy()

# Create the merge in the new dataframe
bep20_txs_contracts_internal_prices=bep20_txs_contracts_internal_prices.merge(trade_prices, left_on=bep20_txs_contracts_internal_prices.index, right_on='tradePricesoriginalIndex', how='left')


# New column which calculates the total coin price in USD. Added extra logic to ignore the `Prices not found` and `Coin not found` values
    # axis = 1 since we want to do row-wise operations
bep20_txs_contracts_internal_prices['contract_id_total_usd'] = bep20_txs_contracts_internal_prices.apply( \
    lambda x: x['valueReal']*float(x['contract_id_price']) \
    if x['contract_id_price'] != 'Prices not found' and x['contract_id_price'] != 'Coin not found'  \
    else x['contract_id_price'], axis = 1) 

# Calculate Gas Fee in USD                                    
bep20_txs_contracts_internal_prices['gasFeeInUSD']=bep20_txs_contracts_internal_prices['gasFeeInBNB']*bep20_txs_contracts_internal_prices['bnb_price']

# Adding columns in case it is too off so we can manually check it
bep20_txs_contracts_internal_prices['contract_id_total_usd_TRUE_CHECK']=pd.NaT # in case it is too off manually check it
bep20_txs_contracts_internal_prices['gasFeeInUSD_TRUE_CHECK']=pd.NaT # in case it is too off manually check it

# Save new Pandas Dataframe to a csv file
bep20_txs_contracts_internal_prices.to_csv(os.path.join(year,f'{year}_bep20_txs_contracts_internal_prices.csv'))

Transaction Type

This cell does not call any APIs but instead creates two columns based on the addresses the transactions occurred from and the number of transactions with the same hash. The column type does a simple check to see if the wallet address is in the from column and fills in 'Sent' or 'Received' using the np.where() method. The first argument for np.where() is the logic, second argument is the value if the logic is True, and third argument is the value if the logic is False. This method can also be chained such as np.where(logic, True, np.where(logic, True, etc...)).

The next portion of this cell creates a more detailed description of the transaction type typeDetail which also checks how many transactions have the same hash. If only one transaction has the same hash that means coins were either sent out (user sent them out), or coins were transferred in (from another wallet) which has a high probability of being spam but hopefully we filtered most of them out with the spamtokens list at the beginning of this Jupyter Notebook. If two transactions have the same hash that usually means an exchange from one token to another or a double reward. If three transactions have the same hash that usually means providing liquidity for an LP or removing liquidity from an LP. I am sure there are more possibilities but these are the ones I found that were most common. This new Pandas Dataframe is then saved as bep20_txs_contracts_internal_prices_type.

In [ ]:
# Read in the saved csv files so we don't have to start over with the API calls
bep20_txs_contracts_internal_prices=pd.read_csv(os.path.join(year,f'{year}_bep20_txs_contracts_internal_prices.csv'))

# Create a copy of the Pandas Dataframe for more data manipulation
bep20_txs_contracts_internal_prices_type=bep20_txs_contracts_internal_prices.copy()

# Fill in the high level transaction type with either Sent or Received
bep20_txs_contracts_internal_prices_type['type']=np.where(bep20_txs_contracts_internal_prices_type['from']==address,'Sent','Received')

# Loop through transactions
for i, row in bep20_txs_contracts_internal_prices_type.iterrows():

    # Find all transactions with the same hash as these will be grouped together
        # Doesn't matter if it searches again in the next loop since idxs will remain the same
    mask_hash=(bep20_txs_contracts_internal_prices_type['hash']==row['hash'])
    idxs=bep20_txs_contracts_internal_prices_type.index[mask_hash].values

    print('-------------------------------------------------')
    print('INDEX',i,'COUNT',bep20_txs_contracts_internal_prices_type.loc[mask_hash,'hash'].count())
    print(idxs)

    # Only one transaction has the same hash
    if len(idxs)==1:

        print('Single: ',bep20_txs_contracts_internal_prices_type.loc[i,'hash'])

        # Usually sending out coins to another address
        if bep20_txs_contracts_internal_prices_type.loc[i,'from'] == address:

            print('SENT')
            bep20_txs_contracts_internal_prices_type.loc[i,'typeDetail'] ='SENT' # works because index is an integer, otherwise use below
        
        # Usually receiving coins from another address BUT a lot of these could be spam tokens so watch out
        elif bep20_txs_contracts_internal_prices_type.loc[i,'to'] == address:

            print('RECEIVED')
            bep20_txs_contracts_internal_prices_type.loc[i,'typeDetail'] ='RECEIVED' # works because index is an integer, otherwise use below

    # Two transactions with the same hash
    elif len(idxs)==2: 

        print('Double: ',bep20_txs_contracts_internal_prices_type.loc[i,'hash'])

        # Usually SOLD
        if bep20_txs_contracts_internal_prices_type.loc[idxs[0],'from'] == address and bep20_txs_contracts_internal_prices_type.loc[idxs[1],'to']  == address:
            bep20_txs_contracts_internal_prices_type.loc[idxs[0],'typeDetail'] = 'Exchange I SOLD'
            bep20_txs_contracts_internal_prices_type.loc[idxs[1],'typeDetail'] = 'Exchange I BOUGHT with above'

        # Usually BOUGHT
        elif bep20_txs_contracts_internal_prices_type.loc[idxs[0],'to'] == address and bep20_txs_contracts_internal_prices_type.loc[idxs[1],'from']  == address:
            bep20_txs_contracts_internal_prices_type.loc[idxs[0],'typeDetail'] = 'Exchange I BOUGHT'
            bep20_txs_contracts_internal_prices_type.loc[idxs[1],'typeDetail'] = 'Exchange I SOLD with above'

        # Usually Reward
        elif bep20_txs_contracts_internal_prices_type.loc[idxs[0],'to'] == address and bep20_txs_contracts_internal_prices_type.loc[idxs[1],'to']  == address:
            bep20_txs_contracts_internal_prices_type.loc[idxs[0],'typeDetail'] = 'REWARD'
            bep20_txs_contracts_internal_prices_type.loc[idxs[1],'typeDetail'] = 'REWARD with above'

    # Three transactions with the same hash
    elif len(idxs)==3:

        print('Triple: ',bep20_txs_contracts_internal_prices_type.loc[i,'hash'])

        # Usually LP BOUGHT, three separate statements due to the different orders they can show up
        if bep20_txs_contracts_internal_prices_type.loc[idxs[0],'from'] == address and bep20_txs_contracts_internal_prices_type.loc[idxs[1],'from']  == address and bep20_txs_contracts_internal_prices_type.loc[idxs[2],'to']  == address:
            bep20_txs_contracts_internal_prices_type.loc[idxs[0],'typeDetail'] = 'Coin #1 for LP I Provided with below'
            bep20_txs_contracts_internal_prices_type.loc[idxs[1],'typeDetail'] = 'Coin #2 for LP I Provided'
            bep20_txs_contracts_internal_prices_type.loc[idxs[2],'typeDetail'] = 'LP I BOUGHT with above'
        elif bep20_txs_contracts_internal_prices_type.loc[idxs[0],'from'] == address and bep20_txs_contracts_internal_prices_type.loc[idxs[1],'to']  == address and bep20_txs_contracts_internal_prices_type.loc[idxs[2],'from']  == address:
            bep20_txs_contracts_internal_prices_type.loc[idxs[0],'typeDetail'] = 'Coin #1 for LP I Provided with below'
            bep20_txs_contracts_internal_prices_type.loc[idxs[1],'typeDetail'] = 'LP I BOUGHT'
            bep20_txs_contracts_internal_prices_type.loc[idxs[2],'typeDetail'] = 'Coin #2 for LP I Provided with above'
        elif bep20_txs_contracts_internal_prices_type.loc[idxs[0],'to'] == address and bep20_txs_contracts_internal_prices_type.loc[idxs[1],'from']  == address and bep20_txs_contracts_internal_prices_type.loc[idxs[2],'from']  == address:
            bep20_txs_contracts_internal_prices_type.loc[idxs[0],'typeDetail'] = 'LP I BOUGHT with below'
            bep20_txs_contracts_internal_prices_type.loc[idxs[1],'typeDetail'] = 'Coin #1 for LP I Provided'
            bep20_txs_contracts_internal_prices_type.loc[idxs[2],'typeDetail'] = 'Coin #2 for LP I Provided with above'


        # Usually LP SOLD, three separate statements due to the different orders they can show up
        elif bep20_txs_contracts_internal_prices_type.loc[idxs[0],'to'] == address and bep20_txs_contracts_internal_prices_type.loc[idxs[1],'to']  == address and bep20_txs_contracts_internal_prices_type.loc[idxs[2],'from']  == address:
            bep20_txs_contracts_internal_prices_type.loc[idxs[0],'typeDetail'] = 'Coin #1 for LP I Received with below'
            bep20_txs_contracts_internal_prices_type.loc[idxs[1],'typeDetail'] = 'Coin #2 for LP I Received'
            bep20_txs_contracts_internal_prices_type.loc[idxs[2],'typeDetail'] = 'LP I SOLD with above'
        elif bep20_txs_contracts_internal_prices_type.loc[idxs[0],'to'] == address and bep20_txs_contracts_internal_prices_type.loc[idxs[1],'from']  == address and bep20_txs_contracts_internal_prices_type.loc[idxs[2],'to']  == address:
            bep20_txs_contracts_internal_prices_type.loc[idxs[0],'typeDetail'] = 'Coin #1 for LP I Received with below'
            bep20_txs_contracts_internal_prices_type.loc[idxs[1],'typeDetail'] = 'LP I SOLD'
            bep20_txs_contracts_internal_prices_type.loc[idxs[2],'typeDetail'] = 'Coin #2 for LP I Received with above'
        elif bep20_txs_contracts_internal_prices_type.loc[idxs[0],'from'] == address and bep20_txs_contracts_internal_prices_type.loc[idxs[1],'to']  == address and bep20_txs_contracts_internal_prices_type.loc[idxs[2],'to']  == address:
            bep20_txs_contracts_internal_prices_type.loc[idxs[0],'typeDetail'] = 'LP I SOLD with below'
            bep20_txs_contracts_internal_prices_type.loc[idxs[1],'typeDetail'] = 'Coin #1 for LP I Received'
            bep20_txs_contracts_internal_prices_type.loc[idxs[2],'typeDetail'] = 'Coin #2 for LP I Received with above'

        # SENT EVERYTHING?
        elif bep20_txs_contracts_internal_prices_type.loc[idxs[0],'from'] == address and bep20_txs_contracts_internal_prices_type.loc[idxs[1],'from']  == address and bep20_txs_contracts_internal_prices_type.loc[idxs[2],'from']  == address:
            bep20_txs_contracts_internal_prices_type.loc[idxs[0],'typeDetail'] = 'LP I SENT with below'
            bep20_txs_contracts_internal_prices_type.loc[idxs[1],'typeDetail'] = 'Coin #1 for LP I SENT'
            bep20_txs_contracts_internal_prices_type.loc[idxs[2],'typeDetail'] = 'Coin #2 for LP I SENT with above'

        # RECEIVED EVERYTHING?
        elif bep20_txs_contracts_internal_prices_type.loc[idxs[0],'to'] == address and bep20_txs_contracts_internal_prices_type.loc[idxs[1],'to']  == address and bep20_txs_contracts_internal_prices_type.loc[idxs[2],'to']  == address:
            bep20_txs_contracts_internal_prices_type.loc[idxs[0],'typeDetail'] = 'LP I RECEIVED with below'
            bep20_txs_contracts_internal_prices_type.loc[idxs[1],'typeDetail'] = 'Coin #1 for LP I RECEIVED'
            bep20_txs_contracts_internal_prices_type.loc[idxs[2],'typeDetail'] = 'Coin #2 for LP I RECEIVED with above'

        # Something else?
        else:

            print('Special',row['hash'])
            bep20_txs_contracts_internal_prices_type.loc[idxs[0],'typeDetail'] = 'TRIPLE with below'
            bep20_txs_contracts_internal_prices_type.loc[idxs[1],'typeDetail'] = 'TRIPLE'
            bep20_txs_contracts_internal_prices_type.loc[idxs[2],'typeDetail'] = 'TRIPLE with above'


# Save new Pandas Dataframe to a csv file
bep20_txs_contracts_internal_prices_type.to_csv(os.path.join(year,f'{year}_bep20_txs_contracts_internal_prices_type.csv'))

Final Dataframe

The final cell extracts specific columns from the main dataframe, adds new columns, and organizes them in a certain order for better readability to create a tracking spreadsheet dataframe. The columns can be rearranged by passing the list of rearranged column names columns_final to the Dataframe.loc[:, columns] method. New columns are then added to specific locations by using the Dataframe.insert() method with the Dataframe.columns.get_loc() method. We then save this final Pandas Dataframe final as a CSV so that it can be further manipulated in Excel.

In [ ]:
# Read in the saved csv files so we don't have to start over with the API calls
bep20_txs_contracts_internal_prices_type=pd.read_csv(os.path.join(year,f'{year}_bep20_txs_contracts_internal_prices_type.csv'))

# Extracting and re-organizing columns for better readability
columns_final=['hash','dateTime','type','typeDetail','from','to','tokenName','valueReal','contract_name','contract_id_price','contract_id_price_TRUE_CHECK','contract_id_closest_dateTime','contract_id_closest_dateTime_TRUE_CHECK', 'contract_id_closest_dateTime_diff_(min)','contract_id_total_usd',
'contract_id_total_usd_TRUE_CHECK','gasFeeInBNB','gasFeeInUSD','gasFeeInUSD_TRUE_CHECK','bnb_price','bnb_price_TRUE_CHECK','bnb_closest_dateTime','bnb_closest_dateTime_TRUE_CHECK','bnb_closest_dateTime_diff_(min)','intTx','intTXvalueTotalreal']


# Create a copy of the Pandas Dataframe with the new column order
final=bep20_txs_contracts_internal_prices_type.loc[: ,columns_final].copy()

# Adding columns in specific locations for better readability, inserted +1 from specific column so it goes afterwards
final.insert(0,'Ignore?','')
final.insert(final.columns.get_loc('hash')+1,'Corresponding_Hash','')
final.insert(final.columns.get_loc('contract_name')+1,'For_LPs_or_Staking_Original_Amount','') 
final.insert(final.columns.get_loc('For_LPs_or_Staking_Original_Amount')+1,'For_LPs_or_Staking_Difference','') 
final.insert(final.columns.get_loc('contract_id_total_usd_TRUE_CHECK')+1,'Original_Price','') 
final.insert(final.columns.get_loc('Original_Price')+1,'Original_Date','') 
final.insert(final.columns.get_loc('Original_Date')+1,'Original_Total_USD','')
final.insert(final.columns.get_loc('Original_Total_USD')+1,'PL_USD','')
final.insert(final.columns.get_loc('PL_USD')+1,'Type_Taxes','')

final.to_csv(os.path.join(year,f'{year}_final.csv'))
In [ ]: