Don't Miss
Home » Premium » Complex Backtesting in Python – Part II – Zipline Data Bundles

Complex Backtesting in Python – Part II – Zipline Data Bundles

In the last article on Python backtesting, we looked at how to install the Zipline library and get a basic simulation going. But what we did not touch upon was how to get your own data hooked up. If you are reading this, there is a good chance that you take your backtesting and trading simulation quite seriously. And in that case, you probably have your own preferred data source and your own local data storage. Today we’re looking closer at how to feed Zipline with your own data. Stocks and futures work a little different, so in this article we’ll start by getting stock data going.

Article Assumptions

For this article, I’ll assume that you have already completed the installation and setup of Zipline from the previous article. I’ll also assume that you have your data in a local MySql database. If your data is stored in another way, it should be very easy to modify my code to read from your source. Whether that is a different type of database, csv flat files or other, Pandas is your friend. (Thanks, Wes)

Keep in mind here, that I am not attempting to show you best practice, most efficient or elegant methods. I’m attempting to teach you the basics and to get you up and running. In my view, pragmatism is king. Get things working first, gold plate it later.

I’m using Windows for all examples here. Windows is generally frowned upon by the Python community, in a way that reminds me very much of childish Unix and Mac snobbery from the early 90’s. Well, Windows is by far the most useful platform to be working with if you’re in finance. It is the standard for anything close to front office. If you really feel like working on other operating systems, feel free. What’s described here should generally work on Macs and various Linux based systems, though some details may differ slightly.

Python Libraries

In the last article, we installed a few Python libraries. Those are required here again, and I’ll assume you still have those installed. Further, since I’m going to read data from a MySql db, we’ll need to install database support. Two libraries should be needed for this; sqlalchamy and mysql connector.

Database Structure

I’ll use a simplistic but workable database structure. Again, feel free to use whatever fits your needs. If you don’t yet have a database structure, use something like this first and then build upon it.

My sandbox database is called ‘ambrosia’. That’s just the local name for it, and we’ll need to know that to refer to it later on. Both username and password is set to ‘root’. You may want to use different values here.

In this database, there’s a table called equity_history. It contains daily time series data for a few thousand American stocks, all adjusted for splits and corporate actions. There’s a column for cash dividends as well, so that we can take that into account.

Also, I have added columns to indicate whether the stock was part of an index on any given day. Sure, this can be done more elegantly, but there’s a reason for keeping it simplistic for now. Strictly speaking, we don’t need this index membership fields for this particular article. I’ve added it there because I intend to use it for a later article. If you don’t have this data, don’t sweat it. We’re not going to use it for this article.

If you lack the dividend data, that’s not great but you can still proceed. If this data is missing from your data source, your results may not be entirely realistic. But that’s a general problem, no matter what backtesting environment you work with.

Zipline Logic and Terminology

Some backtesters let you simply point to a data source and fire off a simulation. That can be very convenient and is generally much easier. But the problem with such an approach is memory management. It’s not scalable. If you have a large amount of data, that can easily consume more memory than you have available, even on a very powerful workstation. Zipline uses a different method to allow for incremental reading to cater for vast data sets.

It’s more complicated to set up, but will scale a whole lot better. Consider a few thousand stocks with minute level data for a few decades back. Take a moment to consider the number of data points involved. It simply wouldn’t be possible to keep that data in memory.

With Zipline, you have to get familiar with the twin concepts of ‘bundles’ and ‘ingest’. In order to use data from any given data source, you first need to use, or write, a bundle which can read, process and store this data in Zipline’s own format. The term ingest refers to the execution of the bundle, running it, fetching the data and storing it.

Before you can run a backtest, you need to ingest a bundle.

There are a couple of useful bundles included (bundled?) with Zipline. In the last article, we used the included bundle quantopian-quandl to get free data from Quandl. But in this article, we’ll write our own.

Here are the steps we need to take before we can use our data:

  • Write the bundle.
  • Register the bundle.
  • Ingest the bundle.

Another useful Zipline concept that you should be aware of is ‘sid’. That’s short for security ID. A unique numerical reference to a specific instrument. This is an internal Zipline kind of reference, and as long as there are no repeat sid values, your securities could have any number you like. For our bundle, we’ll simply sort all stock tickers alphabetically, start numbering from 0 and keep increasing the sid number for each stock.

Writing the Bundle

The bundle is a .py file, which should be located in the Zipline bundles directory. Where exactly that directory is depends on you computer setup and installation. On my development machine, it’s under C:\ProgramData\Anaconda3\envs\zip35\Lib\site-packages\zipline\data\bundles. It may also be under /users/username/appdata/anaconda3/envs/zip35/Lib/… depending on your local installation.

I’m going to use an application called Spyder, which was installed automatically along with the Anaconda package, to write the bundle. Feel free to use any editor you like. You could do this in Notepad, if you really want to.

What I want this bundle to do is the following:

  • Ask the database for a list of all stocks in the history table. (sure, you may want to have a separate meta table later on, but let’s keep it simple for now)
  • For each stock, get the time series of date, open, high, low, close, volume and dividends from the database.
  • Structure the data in the way Zipline likes it, including giving a unique sid number to each stock.
  • Return the data to the Zipline data writer.

Calendar Issue

Zipline is a bit obsessed with calendars. Overly so, for my taste. Or perhaps I just haven’t figured out yet how to get it to chill out on calendar checks.

Zipline assumes that all stocks in a bundle are trading on the same exchange, with the same market holidays and that each stock had trades on each of the possible trading days. It is so obsessed with this, that the process of ingesting (reading) data will crash if this assumption does not hold up.

This could be a good thing, but not always. If your data source has data on a weekend, or a public holiday, the ingest will crash. Fine. But if you have stocks on two exchanges with different holidays, you have an issue. But much more annoying, it will also crash if a stock simply didn’t trade a certain day. Now if you get data for a vast number of stocks, and far back in time, you will encounter this. Some stocks were not liquid enough a decade or two ago, and had days without trades. Kaboom, there goes your ingest. Unless you pad your data first.

Make sure your time series have data for all valid trading days, between the first and the last day available for each. If you do not do this, Zipline’s ingest process will rudely remind you by crashing and complaining about which exact days are missing.

If anyone reading this has figured out a better way to deal with the quite annoying calendar issue, please let us know in the comments.

The Actual Bundle

To construct this bundle, I started off using the included CSV file reader bundle, which was installed with Zipline. I modified it to get data from a db, removed some things not needed, removed support for minute data (for now) and tried to make the code a little more readable.  This code should be in a file called ac_db.py and stored in your own data/bundles folder which is already on your computer. The exact path to to your bundles folder can vary with installation settings, operating system etc. Search and ye shall find.

The name ac_db.py is of course arbitrary, but I will soon refer to this name again. Feel free to call it anything you like, as long as you match it in the bundle registration that we’ll get to soon.

Registering the Bundle

To register the bundle, you need to modify a file called extension.py which is located in your user folder, under .zipline. Again, the exact location can vary with your own installation environment, but on my development machine the path is c:\users\andreas clenow\.zipline\

In this file, we need to tell Zipline that we now have a brand spanking new bundle to use. We need to register the bundle, and explain which calendar it follows. Again, see the calendar issue above. I’m using the built in NYSE calendar here, as all the stocks in my development database are American and follow the same holiday calendar.

Ingesting the Bundle

All should now be set to ingest the bundle. Remember that the term ingest, in Zipline terminology, is the process of reading all the data and storing it in the internal Zipline format. It will allow for quick and incremental reading during simulation runs, reduce memory consumption and allowing for vast data sets.

To ingest the data, we need to open a console with the correct zip35 environment enabled. Simplest is to open the Anaconda Prompt, which was installed with Anaconda. Then type activate zip35. Now the right environment is active. Then type zipline ingest -b ac_equities_db. Now, if all worked well, you should see a progress bar moving as your data is being ingested. Depending on the amount of data in your db, it may take a while.

After this is one, you are able to use this data for simulations. But the details of that would have to be the subject of next article.

 

 

 

One comment

  1. Hi Andreas,

    i run zipline locally using Mongo with the Arctic extensions (courtesy of ManAHL) on non US stocks – this example is FTSE350

    1) i had to modify exchange_calendar_lse.py to take into account royal weddings, deaths etc. by adding this code –
    Royals = [
    Timestamp(‘2002-06-03′, tz=’UTC’),
    Timestamp(‘2002-06-04′, tz=’UTC’),
    Timestamp(‘2011-04-29′, tz=’UTC’),
    Timestamp(‘2012-06-04′, tz=’UTC’),
    Timestamp(‘2012-06-05′, tz=’UTC’)
    ]

    @property
    def adhoc_holidays(self):
    return list(chain(
    Royals,
    ))

    2) In zipline/bundles, I added new file quandl_ukeq_mongo.py which essentially reads data from Mongo and loads it into the zipline db similar to your approach. I have another script which loads data from quandl into mongo. Worth noting that as UK stocks traded in pennies, I divide the price by 100 otherwise it messes up the cash balance in zipline which assumes it’s dollars (or pounds)

    3) I added the following to extensions.py which reads the symbols I wish to load directly from mongo/Arctic

    from zipline.data.bundles import register
    from zipline.data.bundles.quandl_ukeq_mongo import viamongo
    from zipline.utils.calendars import get_calendar
    from zipline.utils.calendars import exchange_calendar_lse
    from arctic import Arctic

    #eqSym = {}
    store = Arctic(“localhost”)
    library = store[‘FTSE350’]
    eqSym = library.read(‘FTSE350_INDEX’).data

    register(
    ‘FTSE_350’, # name this whatever you like
    viamongo(eqSym.index),
    calendar_name=’LSE’,
    )

    4) Finally, the actual back test code itself which I run in a Jupyter notebook where I include this code –

    if __name__ == ‘__main__’:

    #CAPITAL_BASE = 1.0e6
    CAPITAL_BASE = 10000
    ONE_THIRD = 1.0 / 3.0

    start = makeTS(“2016-12-22”); end = makeTS(“2018-03-09”) # this can go anywhere before the TradingAlgorithm

    load_benchmarks(‘ISF’)

    # load the bundle
    bundle_data = load(‘FTSE_350’, os.environ, None)
    cal = bundle_data.equity_daily_bar_reader.trading_calendar.all_sessions
    pipeline_loader = USEquityPricingLoader(bundle_data.equity_daily_bar_reader, bundle_data.adjustment_reader)
    choose_loader = make_choose_loader(pipeline_loader)
    env = TradingEnvironment(bm_symbol=’^FTSE’, exchange_tz=’Europe/London’,trading_calendar=get_calendar(“LSE”),
    asset_db_path=parse_sqlite_connstr(bundle_data.asset_finder.engine.url))

    data = DataPortal(
    env.asset_finder, get_calendar(“LSE”),
    first_trading_day=bundle_data.equity_minute_bar_reader.first_trading_day,
    equity_minute_reader=bundle_data.equity_minute_bar_reader,
    equity_daily_reader=bundle_data.equity_daily_bar_reader,
    adjustment_reader=bundle_data.adjustment_reader,
    )

    # the actual running of the backtest happens in the TradingAlgorithm object
    bt_start = time()
    perf = TradingAlgorithm(
    env=env,
    get_pipeline_loader=choose_loader,
    trading_calendar=get_calendar(“LSE”),
    sim_params=create_simulation_parameters(
    start=start,
    end=end,
    capital_base=CAPITAL_BASE,
    trading_calendar=get_calendar(“LSE”),
    data_frequency=’daily’
    ),
    **{
    ‘initialize’: initialize,
    ‘handle_data’: handle_data,
    ‘before_trading_start’: None,
    ‘analyze’: None,
    }
    ).run(data, overwrite_sim_params=False,)
    bt_end = time()
    pd.options.display.float_format = ‘{:.2f}’.format
    print(perf.columns)
    print(perf[[‘portfolio_value’,’sharpe’,’max_drawdown’]])
    #print(perf[‘sharpe’].groupby([perf.index.year]).mean())
    #print(perf[‘sortino’].groupby([perf.index.year]).mean())
    #print(perf[‘benchmark_period_return’].groupby([perf.index.year]).last())
    #print(np.sqrt(perf[‘benchmark_volatility’].groupby([perf.index.year]).sum()))
    print(“The backtest took %0.2f seconds to run.” % (bt_end – bt_start))

    5) I’m certain this could be significantly improved but it works for my purposes. I’m currently focussing on how I can use alternative economic data within the backtest

    6) One last word not to do with zipline, Mongo/Arctic is a joy to use as it’s basically all pandas, this example on the sector etfs, but I also use this technique for S&P1500, and takes a few seconds to run.

    from arctic import Arctic
    store = Arctic(“localhost”)
    library = store[‘ETF’]
    #store.list_libraries()
    sectors_dict = {
    ‘SPY’: ‘S&P500 Benchmark index fund’,
    ‘XLY’: ‘XLY Consumer Discretionary SPDR Fund’,
    ‘XLF’: ‘XLF Financial SPDR Fund’,
    ‘XLK’: ‘XLK Technology SPDR Fund’,
    ‘XLE’: ‘XLE Energy SPDR Fund’,
    ‘XLV’: ‘XLV Health Care SPRD Fund’,
    ‘XLI’: ‘XLI Industrial SPDR Fund’,
    ‘XLP’: ‘XLP Consumer Staples SPDR Fund’,
    ‘XLB’: ‘XLB Materials SPDR Fund’,
    ‘XLU’: ‘XLU Utilities SPRD Fund’,
    ‘XLRE’: ‘XLRE Real Estate’,
    }
    data = [library.read(k).data[‘Adj_Close’] for k in sectors_dict.keys()]
    data = pd.concat(data,axis=1)
    data.columns = sectors_dict.keys()

    Thanks Stuart