hive -e 和 -f 通过 beeline 执行文件中的 hive sql 语句

2022-10-05 13:46:08 ⋅ 12571 ⋅ 0 ⋅ 0

一、hive beeline详解

Hive客户端工具后续将使用Beeline 替代HiveCLI ，并且后续版本也会废弃掉HiveCLI 客户端工具,Beeline是 Hive 0.11版本引入的新命令行客户端工具,它是基于SQLLine CLI的JDBC客户端。

Beeline支持嵌入模式(embedded mode)和远程模式(remote mode)。在嵌入式模式下，运行嵌入式的Hive(类似Hive CLI)，而远程模式可以通过Thrift连接到独立的HiveServer2进程上。从Hive 0.14版本开始，Beeline使用HiveServer2工作时，它也会从HiveServer2输出日志信息到STDERR。

1. beeline 的常用参数

Usage: java org.apache.hive.cli.beeline.BeeLine 
   -u <database url>               the JDBC URL to connect to
   -n <username>                   the username to connect as
   -p <password>                   the password to connect as
   -d <driver class>               the driver class to use
   -i <init file>                  script file for initialization
   -e <query>                      query that should be executed
   -f <exec file>                  script file that should be executed
   -w (or) --password-file <password file>  the password file to read password from
   --hiveconf property=value       Use value for given property
   --hivevar name=value            hive variable name and value
                                   This is Hive specific settings in which variables
                                   can be set at session level and referenced in Hive
                                   commands or queries.
   --color=[true/false]            control whether color is used for display
   --showHeader=[true/false]       show column names in query results
   --headerInterval=ROWS;          the interval between which heades are displayed
   --fastConnect=[true/false]      skip building table/column list for tab-completion
   --autoCommit=[true/false]       enable/disable automatic transaction commit
   --verbose=[true/false]          show verbose error messages and debug info
   --showWarnings=[true/false]     display connection warnings
   --showNestedErrs=[true/false]   display nested errors
   --numberFormat=[pattern]        format numbers using DecimalFormat pattern
   --force=[true/false]            continue running script even after errors
   --maxWidth=MAXWIDTH             the maximum width of the terminal
   --maxColumnWidth=MAXCOLWIDTH    the maximum width to use when displaying columns
   --silent=[true/false]           be more silent
   --autosave=[true/false]         automatically save preferences
   --outputformat=[table/vertical/csv2/tsv2/dsv/csv/tsv]  format mode for result display
                                   Note that csv, and tsv are deprecated - use csv2, tsv2 instead
  --truncateTable=[true/false]    truncate table column when it exceeds length
   --delimiterForDSV=DELIMITER     specify the delimiter for delimiter-separated values output format (default: |)
   --isolation=LEVEL               set the transaction isolation level
   --nullemptystring=[true/false]  set to true to get historic behavior of printing null as empty string
   --addlocaldriverjar=DRIVERJARNAME Add driver jar file in the beeline client side
   --addlocaldrivername=DRIVERNAME Add drvier name needs to be supported in the beeline client side
   --help                          display this message
Beeline version 2.3.4.spark2 by Apache Hive

file

[root@homaybd03 hive]# cd /usr/hdp/3.1.0.0-78/hive/
[root@homaybd03 hive]# bin/beeline -u jdbc:hive2://192.168.1.124:10000 -n hive -p hive
[root@homaybd03 hive]# /usr/hdp/3.1.0.0-78/hive/bin/beeline -u jdbc:hive2://192.168.1.124:10000 -n hive -p hive -f simple.sql

打印：

[hive@homaybd03 ~]$ /usr/hdp/3.1.0.0-78/hive/bin/beeline -u jdbc:hive2://192.168.1.124:10000 -n hive -p hive -f simple.sql
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/hdp/3.1.0.0-78/hive/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/3.1.0.0-78/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Connecting to jdbc:hive2://192.168.1.124:10000
Connected to: Apache Hive (version 3.1.0.3.1.0.0-78)
Driver: Hive JDBC (version 3.1.0.3.1.0.0-78)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://192.168.1.124:10000> drop table if exists ds_db.yuce_ma_moxing_scr; 
INFO  : Compiling command(queryId=hive_20221005220907_f73ef1d2-b2d8-456a-9bd7-b29449312fc9): drop table if exists ds_db.pre_ma_cshx_12to24mons_scr
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
INFO  : Completed compiling command(queryId=hive_20221005220907_f73ef1d2-b2d8-456a-9bd7-b29449312fc9); Time taken: 0.047 seconds
INFO  : Executing command(queryId=hive_20221005220907_f73ef1d2-b2d8-456a-9bd7-b29449312fc9): drop table if exists ds_db.yuce_ma_moxing_scr;
INFO  : Starting task [Stage-0:DDL] in serial mode
INFO  : Completed executing command(queryId=hive_20221005220907_f73ef1d2-b2d8-456a-9bd7-b29449312fc9); Time taken: 0.38 seconds
INFO  : OK
No rows affected (0.511 seconds)

二、示例

在服务器命令行中直接执行sql语句或者sql文件中的内容
有时需要在命令行中直接执行hive sql语句，希望不进入hive模式执行，这时候就可以使用hive -e 或者 hive -f 来实现
hive -e 和 hive -f

beeline -f ，beeline使用

beeline -u "jdbc:hive2://localhost:10000"  -n yourname -p  passwore -f  hivescript.sql

查询示例：

[root@homaybd04 ~]# su hive
[hive@homaybd04 root]$ hive -e "select * from ds_db.student;select * from ds_db.student limit 1"
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/hdp/3.1.0.0-78/hive/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/3.1.0.0-78/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Connecting to jdbc:hive2://homaybd04:2181,homaybd05:2181,homaybd03:2181/default;password=hive;serviceDiscoveryMode=zooKeeper;user=hive;zooKeeperNamespace=hiveserver2
22/10/05 21:26:53 [main]: INFO jdbc.HiveConnection: Connected to homaybd04:10000
Connected to: Apache Hive (version 3.1.0.3.1.0.0-78)
Driver: Hive JDBC (version 3.1.0.3.1.0.0-78)
Transaction isolation: TRANSACTION_REPEATABLE_READ
INFO  : Compiling command(queryId=hive_20221005212653_98b57836-abee-4a6d-b5d2-fada18a27bfd): select * from ds_db.student
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:student.id, type:int, comment:null), FieldSchema(name:student.name, type:string, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hive_20221005212653_98b57836-abee-4a6d-b5d2-fada18a27bfd); Time taken: 0.161 seconds
INFO  : Executing command(queryId=hive_20221005212653_98b57836-abee-4a6d-b5d2-fada18a27bfd): select * from ds_db.student
INFO  : Completed executing command(queryId=hive_20221005212653_98b57836-abee-4a6d-b5d2-fada18a27bfd); Time taken: 0.006 seconds
INFO  : OK
+-------------+---------------+
| student.id  | student.name  |
+-------------+---------------+
| 1           | zhangsan      |
| 2           | lisi          |
+-------------+---------------+
2 rows selected (0.655 seconds)
INFO  : Compiling command(queryId=hive_20221005212653_dc1edb81-2e3e-453d-8e93-a009d9dfc31e): select * from ds_db.student limit 1
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:student.id, type:int, comment:null), FieldSchema(name:student.name, type:string, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hive_20221005212653_dc1edb81-2e3e-453d-8e93-a009d9dfc31e); Time taken: 0.162 seconds
INFO  : Executing command(queryId=hive_20221005212653_dc1edb81-2e3e-453d-8e93-a009d9dfc31e): select * from ds_db.student limit 1
INFO  : Completed executing command(queryId=hive_20221005212653_dc1edb81-2e3e-453d-8e93-a009d9dfc31e); Time taken: 0.005 seconds
INFO  : OK
+-------------+---------------+
| student.id  | student.name  |
+-------------+---------------+
| 1           | zhangsan      |
+-------------+---------------+
1 row selected (0.247 seconds)
Beeline version 3.1.0.3.1.0.0-78 by Apache Hive
Closing: 0: jdbc:hive2://homaybd04:2181,homaybd05:2181,homaybd03:2181/default;password=hive;serviceDiscoveryMode=zooKeeper;user=hive;zooKeeperNamespace=hiveserver2
[hive@homaybd04 root]$

脚本执行多个SQL语句，实际是根据分号来一条一条执行：
file

故意将要查询的表名 student 写错为 student4，然后看执行的结果，发现是串行的，一个执行错误，另一个就不再往下执行：

[hive@homaybd04 ~]$ hive -e "select * from ds_db.student4;select * from ds_db.student limit 1"
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/hdp/3.1.0.0-78/hive/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/3.1.0.0-78/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Connecting to jdbc:hive2://homaybd04:2181,homaybd05:2181,homaybd03:2181/default;password=hive;serviceDiscoveryMode=zooKeeper;user=hive;zooKeeperNamespace=hiveserver2
22/10/05 22:02:12 [main]: INFO jdbc.HiveConnection: Connected to homaybd04:10000
Connected to: Apache Hive (version 3.1.0.3.1.0.0-78)
Driver: Hive JDBC (version 3.1.0.3.1.0.0-78)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Error: Error while compiling statement: FAILED: SemanticException [Error 10001]: Line 1:14 Table not found 'student4' (state=42S02,code=10001)
Closing: 0: jdbc:hive2://homaybd04:2181,homaybd05:2181,homaybd03:2181/default;password=hive;serviceDiscoveryMode=zooKeeper;user=hive;zooKeeperNamespace=hiveserver2
[hive@homaybd04 ~]$ hive -e "select * from ds_db.student;select * from ds_db.student4 limit 1"
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/hdp/3.1.0.0-78/hive/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/3.1.0.0-78/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Connecting to jdbc:hive2://homaybd04:2181,homaybd05:2181,homaybd03:2181/default;password=hive;serviceDiscoveryMode=zooKeeper;user=hive;zooKeeperNamespace=hiveserver2
22/10/05 22:02:40 [main]: INFO jdbc.HiveConnection: Connected to homaybd04:10000
Connected to: Apache Hive (version 3.1.0.3.1.0.0-78)
Driver: Hive JDBC (version 3.1.0.3.1.0.0-78)
Transaction isolation: TRANSACTION_REPEATABLE_READ
INFO  : Compiling command(queryId=hive_20221005220240_e3666c67-231a-4c35-baed-4bf4620aa9a9): select * from ds_db.student
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:student.id, type:int, comment:null), FieldSchema(name:student.name, type:string, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hive_20221005220240_e3666c67-231a-4c35-baed-4bf4620aa9a9); Time taken: 0.122 seconds
INFO  : Executing command(queryId=hive_20221005220240_e3666c67-231a-4c35-baed-4bf4620aa9a9): select * from ds_db.student
INFO  : Completed executing command(queryId=hive_20221005220240_e3666c67-231a-4c35-baed-4bf4620aa9a9); Time taken: 0.004 seconds
INFO  : OK
+-------------+---------------+
| student.id  | student.name  |
+-------------+---------------+
| 1           | zhangsan      |
| 2           | lisi          |
+-------------+---------------+
2 rows selected (0.337 seconds)
Error: Error while compiling statement: FAILED: SemanticException [Error 10001]: Line 1:14 Table not found 'student4' (state=42S02,code=10001)
Closing: 0: jdbc:hive2://homaybd04:2181,homaybd05:2181,homaybd03:2181/default;password=hive;serviceDiscoveryMode=zooKeeper;user=hive;zooKeeperNamespace=hiveserver2
[hive@homaybd04 ~]$

查看后端的HiveServer2服务：http://192.168.1.124:10002/

file

用于执行的sql脚本可以在本地文件系统，也可以在HDFS上：

# 本地文件系统
hive -f /usr/file/simple.sql;

# HDFS文件系统
hive -f hdfs://hadoop001:8020/tmp/simple.sql;
其中simple.sql内容如下：

select * from emp;

二、插入数据

hive_test.py

from pyhive import hive
import sqlalchemy as sa
from sqlalchemy import create_engine, event
import pandas as pd
import time
import sys

def hive_cli(username, host, port, database):
    conn = hive.connect(host=host, port=port, username=username, database=database)
    return conn

def hive_engine(username, password, host, port, database):
    """
    https://github.com/pandas-dev/pandas/issues/15276

    :param username:
    :param password:
    :param host:
    :param port:
    :param database:
    :return:
    """
    engine = create_engine('hive://{}:{}@{}:{}/{}?auth=LDAP'.format(username, password, host, port, database))
    return engine

def hive_engine_old(username, password, host, port, database):
    """
    https://github.com/pandas-dev/pandas/issues/15276

    :param username:
    :param password:
    :param host:
    :param port:
    :param database:
    :return:
    """
    # engine = create_engine('hive://{}:{}@{}:{}/{}?auth=LDAP'.format(username, password, host, port, database))
    engine = create_engine('hive://{}:{}@{}:{}/{}?auth=CUSTOM'.format(username, password, host, port, database))
    return engine

def get_table_data(engine, table_name, offset=1, limit=10, order_by=None) -> pd.DataFrame:
    table = sa.Table(table_name, sa.MetaData(), autoload=True, autoload_with=engine)
    if order_by is not None and order_by != "":
        data = pd.read_sql(table.select().order_by(order_by).offset(offset).limit(limit), engine)
    else:
        data = pd.read_sql(table.select().offset(offset).limit(limit), engine)
    return data

def save_data(engine, table_name, data, if_exists="replace"):
    result = data.to_sql(table_name, con=engine, index=False, if_exists=if_exists, method="multi")
    return result

def timeit(f):
    def wrapper(*args, **kwargs):
        start_time = time.time()
        res = f(*args, **kwargs)
        end_time = time.time()
        print("%s函数运行时间:%.2f" % (f.__name__, end_time - start_time))
        return res
    return wrapper

@timeit
def main():
   df = pd.read_csv("./train.csv")
   # print(df.head(1))
   # print("打印数据")

   # 追加数据
   total_list = []

   for i in range(0, 125):
       print(f"开始合并 {i}")
       total_list.append(df)

   # print(total_list)

   username = "hive"
   password = "hive"
   host = '192.168.1.124'
   port = 10000
   database = 'test_db'

   print("连接数据库")
   engine = hive_engine(username, password, host, port, database)
   print("数据库已连接")

   print("条数：", len(total_list))

   res_df = pd.concat(total_list)
   print("总数：", len(res_df))

   res = save_data(engine, "fanqi_train_tbl", res_df, if_exists="replace")
   print(res)

   # print(df)

if __name__ == '__main__':
    main()

查看执行步骤：
file