Please find below and help.
I have a table in hive with 351837
(110 MB size) records and i am reading this table using python and writing into sql server.
In this process while reading data from hive into pandas dataframe it is taking long time. When i load entire records(351k) it takes 90 minutes.
To improve i went with following approach like reading 10k rows once from hive and writing into sql server. But reading 10k rows once from hive and assinging it to Dataframe is alone taking 4-5 minutes of time.
Please see below code.
def execute_hadoop_export(): """ This will run the steps required for a Hadoop Export. Return Values is boolean for success fail """ try: hql='select * from db.table ' # Open Hive ODBC Connection src_conn = pyodbc.connect("DSN=****",autocommit=True) cursor=src_conn.cursor() #tgt_conn = pyodbc.connect(target_connection) # Using SQLAlchemy to dynamically generate query and leverage dataframe.to_sql to write to sql server... sql_conn_url = urllib.quote_plus('DRIVER={ODBC Driver 13 for SQL Server};SERVER=Xyz;DATABASE=Db2;UID=ee;PWD=*****') sql_conn_str = "mssql+pyodbc:///?odbc_connect={0}".format(sql_conn_url) engine = sqlalchemy.create_engine(sql_conn_str) # read source table. vstart=datetime.datetime.now() for df in pandas.read_sql(hql, src_conn,chunksize=10000): # Remove Table Alias from Columns (happens by default in hive due to odbc settings (Use Native Query perhaps?)) vfinish=datetime.datetime.now() df.rename(columns=lambda x: remove_table_alias(x), inplace=True) print 'Finished 10k rows reading from hive and it took', (vfinish-vstart).seconds/60.0,' minutes' # Get connection string for target from Ctrl.Connnection df.to_sql(name='table', schema='dbo', con=engine, chunksize=10000, if_exists="append", index=False) print 'Finished 10k rows writing into sql server and it took', (datetime.datetime.now()-vfinish).seconds/60.0, ' minutes' vstart=datetime.datetime.now() cursor.Close() except Exception, e: print str(e)
Please find below images about output.
Can you people kindly suggest me the fastest way to read hive table data in python.
Note: I have also tried with sqoop export option but my hive table is already in bucketting format.