Pandas Change to_gbq() from stream to batch loading

A small, complete example of the issue

Recently we have discovered that streaming loads into google tables are best effort but can be up to 90 minute delay. It has no transactional guarantees. As such to_gbq() returning is no guarantee of anything.

I propose that we switch the loading from tableData.insertAll() to the batch bigquery load, either using an http upload, or pushing to a cloud bucket. The latter would suck because it requires bucket perms too, but I'm not sure on the size bounds for the http stream version.

However, the atomicity seems important in the case where we next execute a BQ query with the recently uploaded to_gbq() results and worth a tradeoff in terms of time.

At least we would could include an option. Thoughts ?

(not sure how to label this IO:google)

Output of `pd.show_versions()`

commit: None python: 2.7.11.final.0 python-bits: 64 OS: Darwin OS-release: 14.5.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 pandas: 0.18.1 nose: 1.3.7 pip: 9.0.0 setuptools: 25.2.0 Cython: None numpy: 1.9.2 scipy: 0.15.1 statsmodels: 0.6.1 xarray: None IPython: 5.1.0 sphinx: 1.3.1 patsy: 0.3.0 dateutil: 2.5.3 pytz: 2015.4 blosc: None bottleneck: None tables: None numexpr: None matplotlib: 1.4.3 openpyxl: None xlrd: 0.9.3 xlwt: 0.7.5 xlsxwriter: None lxml: None bs4: 4.3.2 html5lib: None httplib2: 0.9.2 apiclient: 1.5.0 sqlalchemy: 1.0.4 pymysql: None psycopg2: 2.6 (dt dec pq3 ext lo64) jinja2: 2.8 boto: None pandas_datareader: None

Comment From: jorisvandenbossche

cc @parthea

Comment From: nicku33

Hey, just want to pick up this discussion again. We have had some more confirmed cases of the streaming uploaded data not appearing in a timely manner. However, the delay is not deterministic. I depends on google and speaking to Jordan Tigrani, he said the only guarantee they provide is 90 minutes, though typical cases are much shorter, obviously.

I think most users would expect to_gbq() to be transactional. The ability to query directly after to_gbq() is executed is probably a common use case. For example, one may use pandas to make a scoring table, push to bq, and run bq over a huge dataset joined to the scoring table where the large table is too large to push into local memory.

At work, we've written a helper function to use the non-streaming load and most people are avoiding to_gbq() because of this issue.

I see two tacks: a) push google to have better bounds on streaming insertion b) rewrite it as a big query load

However this latter option requires the user to have write access into a google bucket, which is a more complicated config.

Thoughts ?

Comment From: nicku33

Here's code to see the behaviour:

import time
import pandas as pd
from googleapiclient.discovery import build
from oauth2client.client import GoogleCredentials
import random 

project_id = 'sample-gcp-project'
credentials = GoogleCredentials.get_application_default()
bigquery_service = build('bigquery', 'v2', credentials=credentials)
bq_tablename = 'tmp.tmp'

# Replace the table ad nauseum
k = 0
while(1):

    n=random.randint(1000,10000)
    df = pd.DataFrame({'a': xrange(n), 'b': xrange(n)})
    pd.io.gbq.to_gbq(df, bq_tablename, project_id, if_exists='replace')

    # Did it work?
    done = False
    while not done:
        check_query = 'SELECT COUNT(1) as ct FROM ' + bq_tablename
        df_check = pd.io.gbq.read_gbq(check_query, project_id)
        n_check = df_check['ct'].values[0]
        if n_check == len(df):
            print 'Success!'
            break
        else:
            print 'Crap, table length is {0} not {1}.'.format(n_check, len(df))
        k += 1
        print '(k = %s) Sleeping 10 seconds before retrying the check or uploading again.' % (k)
        time.sleep(10)

Comment From: max-sixty

Agree in principle that a batch upload is a much better fit for this than a streaming load

However this latter option requires the user to have write access into a google bucket, which is a more complicated config.

Where do you see that? While you could load into Google Storage, and then from there to BigQuery, you can also do a single POST request: https://cloud.google.com/bigquery/loading-data-post-request

Comment From: nicku33

Ah you're right. I misunderstood. I will look into what their POST size limits are to be safe.

If we agree that batch is better, does anyone object if I try to write this patch myself over the holidays ? I've wanted to contribute to pandas in some way for years and this would be a nice easy one to get my feet wet.

Comment From: jreback

moved to pydata/pandas-gbq#7

A small, complete example of the issue

Output of pd.show_versions()

Output of `pd.show_versions()`