Performance issue while reading data from hive using python

Please find below and help.

I have a table in hive with 351837(110 MB size) records and i am reading this table using python and writing into sql server.

In this process while reading data from hive into pandas dataframe it is taking long time. When i load entire records(351k) it takes 90 minutes.

To improve i went with following approach like reading 10k rows once from hive and writing into sql server. But reading 10k rows once from hive and assinging it to Dataframe is alone taking 4-5 minutes of time.

Please see below code.

def execute_hadoop_export():        """        This will run the steps required for a Hadoop Export.          Return Values is boolean for success fail        """        try:             hql='select * from db.table '            # Open Hive ODBC Connection            src_conn = pyodbc.connect("DSN=****",autocommit=True)            cursor=src_conn.cursor()            #tgt_conn = pyodbc.connect(target_connection)             # Using SQLAlchemy to dynamically generate query and leverage dataframe.to_sql to write to sql server...            sql_conn_url = urllib.quote_plus('DRIVER={ODBC Driver 13 for SQL Server};SERVER=Xyz;DATABASE=Db2;UID=ee;PWD=*****')            sql_conn_str = "mssql+pyodbc:///?odbc_connect={0}".format(sql_conn_url)            engine = sqlalchemy.create_engine(sql_conn_str)            # read source table.            vstart=datetime.datetime.now()            for df in pandas.read_sql(hql, src_conn,chunksize=10000):               # Remove Table Alias from Columns (happens by default in hive due to odbc settings (Use Native Query perhaps?))                vfinish=datetime.datetime.now()                df.rename(columns=lambda x: remove_table_alias(x), inplace=True)                            print 'Finished 10k rows reading from hive and it took', (vfinish-vstart).seconds/60.0,' minutes'            # Get connection string for target from Ctrl.Connnection                 df.to_sql(name='table', schema='dbo', con=engine, chunksize=10000, if_exists="append", index=False)                 print 'Finished 10k rows writing into sql server and it took', (datetime.datetime.now()-vfinish).seconds/60.0, ' minutes'                vstart=datetime.datetime.now()            cursor.Close()          except Exception, e:            print str(e) 

Please find below images about output. Result

Can you people kindly suggest me the fastest way to read hive table data in python.

Note: I have also tried with sqoop export option but my hive table is already in bucketting format.

Unable to connect hive to local window machine. Getting Connection error: java.sql.SQLException: Could not open client transport with JDBC Uri

I am trying to make connection between hive-server2 and my local window machine with python.I have Connection string and a keystore file. I am using Jaydebeapi python module to solve this issue. The various .jar file I am using: 1) HiveJDBC4.jar
2) hive_metastore.jar
3) hive-service-0.13.1.jar
4) libfb303-0.9.0.jar
5) libthrift-0.9.0.jar
6) log4j-1.2.14.jar
7) ql.jar
8) slf4j-api-1.5.8.jar
9) slf4j-log4j12-1.5.8.jar
10) TCLIServiceClient.jar
11) httpclient-4.3.3.jar
12) httpcore-4.3.jar
13) guava-16.0.1.jar
14) hadoop-common-2.2.0.jar
15) hive-common-0.10.0.jar

My connection string is:

jdbc:hive2://example@domain.com:port/;   ssl=true;   sslTrustStore=FileKey.jks;   trustStorePassword=password;   transportMode=http;   httpPath=gateway/default/hive    

I have tried other modules but for the given problem and inputs, Jaydebeapi approach I found to be more valid here. I have written python code:

import jaydebeapi Jars = ['C:/Cloudera_HiveJDBC/HiveJDBC4.jar',    'C:/Cloudera_HiveJDBC/hive_metastore.jar',    'C:/Cloudera_HiveJDBC/hive-service-0.13.1.jar',    'C:/Cloudera_HiveJDBC/libfb303-0.9.0.jar',    'C:/Cloudera_HiveJDBC/libthrift-0.9.0.jar',    'C:/Cloudera_HiveJDBC/log4j-1.2.14.jar',    'C:/Cloudera_HiveJDBC/ql.jar',    'C:/Cloudera_HiveJDBC/slf4j-api-1.5.8.jar',    'C:/Cloudera_HiveJDBC/slf4j-log4j12-1.5.8.jar',    'C:/Cloudera_HiveJDBC/TCLIServiceClient.jar',    'C:/Cloudera_HiveJDBC/httpclient-4.3.3.jar',    'C:/Cloudera_HiveJDBC/httpcore-4.3.jar',    'C:/Cloudera_HiveJDBC/guava-16.0.1.jar',    'C:/Cloudera_HiveJDBC/hadoop-common-2.2.0.jar',    'C:/Cloudera_HiveJDBC/hive-common-0.10.0.jar'] conn_hive = jaydebeapi.connect('org.apache.hive.jdbc.HiveDriver','jdbc:hive2://example@domain.com:port/',                            {'ssl':"true",                            'sslTrustStore':"Filekey.jks",                            'trustStorePassword':"password",                             'transportMode':"http",                             'httpPath':"gateway/default/hive"                            },                            jars= Jars) cursor = conn_hive.cursor() 

But I am getting an error:

java.sql.SQLExceptionPyRaisable: java.sql.SQLException: Could not open client transport with JDBC Uri: jdbc:hive2://example@domain.com:port/: null 

Is there any problem with the code, in approach or lack of Jar Files I have added. Is there any other alternatives that I can use for giving best results.

how to write hive queries for following problems

i have a table with 10 columns ColumnNo Name Example DataType

Column1 athlete_name Michael Phelps STRING

Column2 age 23 INT Column3 country United States STRING

Column4 year 2008 INT

Column5 closing_date 8/24/2008 STRING

Column6 sport Swimming STRING

Column7 gold_medals 8 INT

Column8 silver_medals 0 INT

Column9 bronze_medals 0 INT

Column10 total_medals 8 INT

i have few queries to write

Find out the total number of medals won by each country in swimming  Find out the total number of medals won by India in each year  Find out the total number of medals won by each country and also display the name of the country  Find out the total number of gold medals won by each country  Find out the country got medals for Shooting for each year

Apache Drill – Hive Storage Plugin

I wanted to connect to Hive via drill and I am able to successfully do it.

However now in certain cases of my need, I have a problem with the “fs.default.name” parameter where we mention the NameNode IP & port. The problem now is, say I have a 10 node cluster and I have configured my cluster to have a Standby NameNode in case the primary NameNode fails/crashes. Now the IP of my NameNode changes (its not same as Primary NameNode IP). So how can I solve this problem of hard-coding the IP address? Do we have some sort of NameSpace concept? Any help is much appreciated.

{     "type": "hive",     "enabled": false,     "configProps": {     "hive.metastore.uris": "thrift://hdfs41:9083",     "hive.metastore.sasl.enabled": "false",     ** 

“fs.default.name”: “hdfs://10.10.10.41/”

** } }

Replacing cached domain credentials in SECURITY hive

Windows stores the (NTLM) hashes of local users’ passwords in the SAM hive. By booting from a live system (for example), one can not only extract those hashes for offline cracking, but also simply replace the hash with that of a known password (for example, chntpw in Kali Linux is a tool that excels at this task). Similarly, one can turn a normal user into an admin user and enable/disable users. So far, so good.

In a similar, yet different fashion, the password hashes of domain accounts of users that have previously logged in on the computer are stored in the SECURITY hive so that a user can re-login even when they are off the network. Tools like cachedump can extract those hashes for offline cracking. However, due to the different hashing algorithm used, most (all?) tools that can replace hashes in SAM cannot do the same in SECURITY.

Now my question: Is it possible to replace the cached password hash of a domain user with that of a known password, in order to then reboot the system and log in with the known password (bonus points for answers specific to Windows 10, in case there are differences to previous versions)? This of course assumes that the device is off the network so Windows cannot check the password online with the Domain Controller.

Load data from pipe delimited flat file to hive table ignoring first column, first row and last row [on hold]

I have a set of files in the below format

For example-

H|| D|12|canada| D|432| T|4

I want to load only the data part into a hive table ignoring the Header row, trailer row and also the first column having ‘D’

I know we can use table set properties to remove header part, is there anything similar to remove trailer and first column as well?

Why not abIe select Hive Metastore DB on different subscription when creating HDInsight cluster

I am trying to create an HDInsight(Hadoop+Linux) cluster from the Azure portal. While trying to specify existing external Hive Metastore, when I am trying to select list of SQL Servers from drop down,it only shows the list of Azure SQL servers which are in the same subscription form where I am trying to create the cluster. My Metastore SQL server/db exists in a different subscription. Now the interesting part is that I am able to do so when I try to use Powershell. So why does the portal restrict it?

Apache Hive configuration in GCP – clueless?

I have tried to configure an Apache Hive Cluster in GCP using the following link on the Google Cloud Platform.

https://cloud.google.com/solutions/using-apache-hive-on-cloud-dataproc

I have set up a project and have a whole years worth of credits available. The script mentioned in this page uses the Google Shell to confugure the cluster instead of using the GCP UI.

Herein lies the problem, I carry out all the steps above and just as I am about to create the cluster using the following command – “It fails” !!

gcloud dataproc clusters create hive-cluster. \     --scopes sql-admin \     --image-version 1.3 \     --initialization-actions gs://dataproc-initialization-actions/cloud-sql-proxy/cloud-sql-proxy.sh \     --properties hive:hive.metastore.warehouse.dir=gs://$  PROJECT-warehouse/datasets \     --metadata "hive-metastore-instance=$  PROJECT:$  REGION:hive-metastore" 

The error I get is “Insufficient ‘CPUS’ quota. Requested 12.0, available 8.0.”

Now here is the problem. If I had used the GCP UI – dataproc to create the cluster, I could have been able to configured the CPU’s. But that option has its own limitations with respect to Hive.

So I had to launch the cluster pro grammatically via the shell. Now I don;t know how to fix this, unless I physically open every shell script available on the google bucket and find where these values have been set.

Any help anyone? Maybe I should be looking at this differently…