在Spark中加載Redshift數據問題彙總

原創

码一八

2020-06-21 07:53

1. java.sql.SQLException: No suitable driver

這個錯誤是因爲，連接Redshift時需要一個driver，而程序執行時找不到能用的driver，所以報錯。AWS提供了多個版本連接Redshift的driver，點擊查看。

2. java.lang.NoClassDefFoundError: com/amazonaws/services/kinesis/model/Record

經過幾次嘗試發現，直接使用AWS提供的驅動可以連上Redshift，打印出表結構，但是不能加載數據，一加載數據會報這個奇怪的錯誤，表結構都可以打印出來，爲什麼不能加載數據呢？我想不通。幾番查詢，找到了一個包裝庫，github地址。

3. java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).

按照2裏面的github庫裏的文檔說明配置好後，可能會報這個錯。因爲spark-redshift用到了S3，所以要配置key和secret纔可以。文檔裏也提供了幾種方式，i、ii和iii，開始我選擇的是第三種方式，直接寫在了URI裏面。

4. java.lang.NoClassDefFoundError: com/eclipsesource/json/Json

緊接着，配置好aws的key和secret，可能會遇到這個錯誤。這個錯誤一眼看上去感覺奇怪，爲什麼會報json的錯誤呢？在spark-redshift的issue裏面找到了遇到同樣問題的人，最下面arvindkanda提供瞭解決方案，啓動時提供一個額外的jar包就可以了。

5. java.sql.SQLException: Amazon Invalid operation: S3ServiceException:The S3 bucket addressed by the query is in a different region from this cluster.

這個問題是說，S3和EMR必須在同一個region，不然Spark是讀不到Redshift的數據的。我這裏用的都是us-west-2，Oregon，俄勒岡。

6. com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request;

這個問題，就比較厲害了，卡了我好幾個小時。網上各種方案都在說，因爲簽名版本的問題，所以訪問S3時，必須指定S3的endpoint，查來的都是s3a的，比如這個。但是因爲spark-redshift裏用的是s3n，我就將a替換成了n，但是這個問題還是在。各種方案不斷嘗試，可能是運氣好，莫名的就試對了一種方式：將3裏面的方式替換成ii，然後再配置sc.hadoopConfiguration.set("fs.s3a.endpoint", "s3.us-west-2.amazonaws.com")，就可以了。

最終代碼如下，

spark = SparkSession.builder.getOrCreate()
spark._jsc.hadoopConfiguration().set('fs.s3n.awsAccessKeyId', aws_access_key_id)
spark._jsc.hadoopConfiguration().set('fs.s3n.awsSecretAccessKey', aws_secret_access_key)
spark._jsc.hadoopConfiguration().set("fs.s3n.endpoint", "s3.us-west-2.amazonaws.com")

rsdf = spark.read\
        .format('com.databricks.spark.redshift')\
        .option('url', 'jdbc:redshift://host:port/schema')\
        .option('dbtable', 'table_name')\
        .option('user', 'username')\
        .option('password', 'password')\
        .option('tempdir', 's3n://bucket/dir')\
        .load()
# 打印表結構
rsdf.printSchema()
# 打印表內容
rsdf.show()

關於spark啓動命令參數，這篇文章已經說明過，這裏就不再贅述。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

在Spark中加載Redshift數據問題彙總

1. java.sql.SQLException: No suitable driver

2. java.lang.NoClassDefFoundError: com/amazonaws/services/kinesis/model/Record

3. java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).

4. java.lang.NoClassDefFoundError: com/eclipsesource/json/Json

5. java.sql.SQLException: Amazon Invalid operation: S3ServiceException:The S3 bucket addressed by the query is in a different region from this cluster.

6. com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request;

最終代碼如下，

關於spark啓動命令參數，這篇文章已經說明過，這裏就不再贅述。

在Spark中加載Redshift數據問題彙總

nginx 轉發錯誤 13 permission denied

「原碼反碼補碼移碼」一探究竟（上）

『Effective Java』讀書整理

RHEL(Red Hat Enterprise Linux) 安裝 zip、unzip

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結