Sparkstreaming-windows測試過程異常問題記錄

--conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -verbose:gc -XX:+UseG1GC -Xloggc:gc.log" \
--conf 'spark.driver.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -verbose:gc -XX:+UseG1GC -Xloggc:gc.log' \

1、異常問題記錄:

這裏寫圖片描述

解決辦法:去http://search.maven.org上下載對應的.jar,如下載:spark-streaming-kafka-0-8-assembly_2.11-2.4.5.jar

放在site-page的目錄下,我這邊的路徑爲:/usr/lib/python2.7/site-packages/pyspark/jars,而我python安裝路徑:/usr/bin/python2.7

2. hdfs上無權限問題:

org.apache.hadoop.security.AccessControlException: Permission denied: user=angela, access=WRITE,

inode="/user/angela/checkpoint/sparkstreaming_windows_31229":hdfs:hdfs:drwxr-xr-x

解決方案 ;找對應的同事新建一個/user/angela目錄,並賦予相應權限。

3.代碼報錯:

20/04/01 14:26:30 ERROR scheduler.JobScheduler: Error generating jobs for time 1585722390000 ms
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input Pattern hdfs://CQBAK/user/etl/SHHadoopStream/BulkZip/{} matches 0 files
    at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
    at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
    at org.apache.spark.input.StreamFileInputFormat.setMinPartitions(PortableDataStream.scala:51)
    at org.apache.spark.rdd.BinaryFileRDD.getPartitions(BinaryFileRDD.scala:51)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)

 

解決辦法:在trans方法裏面添加一行代碼:rdd.isEmpty()的判斷

rdef trans(rdd):
    sc = rdd.context
    if rdd.isEmpty():
        return rdd

    bulkZipEventList = rdd.collect()
    hdfsPath = "hdfs:///user/etl/HadoopStream/HaHa/%s" % (generateHDFSPathListFromFileList(EventList))
    rdd_zip_files = sc.binaryFiles(hdfsPath, 128)
    return rdd_zip_files

4.在SparkStreaming中的 finallyDS.foreachRDD(getCurrView),在getCurrnView方法中,根據rdd.context生產sc,然後生成SQLcontext和Hivecontext,其中HiveContext中是讀取hive中的數據,項目目錄下有添加hive-site.xml,SparkConf中也設置了hive元數據的配置:sconf.set("hive.metastore.uris", "thrift://s3.hdp.com:9083"),但程序依舊報錯:pyspark.sql.utils.AnalysisException: u"Table or view not found: `dtd_ods`.`stn_history`; line 1 pos 14;\n'GlobalLimit 1\n+- 'LocalLimit 1\n   +- 'Project [*]\n      +- 'UnresolvedRelation `dtd_ods`.`stn_history`\n"

解決辦法:各種嘗試後,發現必須先生成HiveContext,在生成SQLContext。代碼如下:

     sc=rdd.context
    hiveContext = HiveContext(sc)
    sqlContext = SQLContext(sc)

7.Py4JJavaError: An error occurred while calling o247.showString.
: java.lang.AssertionError: assertion failed: No plan for HiveTableRelation `dtd_ods`.`stn_history`, org.apache.hadoop.hive.ql.io.orc.OrcSerde, [servicetag#236, testitem#237, station#238, test_result#239, uut_state_message#240, stn_timestamp#241, uut_state_timestamp#242, last_uut_state_timestamp#243, last_time#244, product_name#245, site_code#246, localization#247, operation_shift#248, gbu#249, product_line#250, date_datetime#251, create_timestamp#252], [day#253]

 

 

8.Py4JJavaError: An error occurred while calling o254.collectToPython.
: java.lang.IllegalStateException: table stats must be specified.
    at org.apache.spark.sql.catalyst.catalog.HiveTableRelation$$anonfun$computeStats$2.apply(interface.scala:629)
    at org.apache.spark.sql.catalyst.catalog.HiveTableRelation$$anonfun$computeStats$2.apply(interface.scala:629)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.sql.catalyst.catalog.HiveTableRelation.computeStats(interface.scala:628)
    at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.default(SizeInBytesOnlyStatsPlanVisitor.scala:55)
    at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.default(SizeInBytesOnlyStatsPlanVisitor.scala:27)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor$class.visit(LogicalPlanVisitor

10.

20/04/08 15:38:17 INFO ui.SparkUI: Bound SparkUI to 0.0.0.0, and started at http://s1:4040

Exception in thread "main" java.lang.NoClassDefFoundError: com/sun/jersey/api/client/config/ClientConfig

            at org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:55)

            at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.createTimelineClient(YarnClientImpl.java:181)

            at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:168)

            at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)

            at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:161)

            at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBac

解決方案:將/usr/hdp/2.6.3/hadoop-yarn/lib下面的jersey-client-1.9.jar,jersey-core-1.9.jar copy到/usr/lib/python2.7/site-packages/pyspark/jars下。同時將jars下的jersey-client-2.22.2.jar mv爲jersey-client-2.22.2.jar.backup

參考:https://my.oschina.net/xiaozhublog/blog/737902

10.Exception in thread "main" java.lang.IllegalAccessError: tried to access method org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider.getProxyInternal()Ljava/lang/Object; from class org.apache.hadoop.yarn.client.RequestHedgingRMFailoverProxyProviderat org.apache.hadoop.yarn.client.RequestHedgingRMFailoverProxyProvider.init(RequestHedgingRMFailoverProxyProvider.java:75)at org.apache.hadoop.yarn.client.RMProxy.createRMFailoverProxyProvider(RMProxy.java:163)at org.apache.hadoop.yarn.client.RMProxy.createRMProxy(RMProxy.java:94)at org.apache.hadoop.yarn.client.ClientRMProxy

解決辦法:將yarn.client.failover-proxy-provider=org.apache.hadoop.yarn.client.RequestHedgingRMFailoverProxyProvider

修改爲:yarn.client.failover-proxy-provider=org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider

參考:https://www.w3xue.com/exp/article/20202/74219.html

11.python 終端 ImportError: ('No module named 

解決方法:在sparkContext中添加依賴的python文件

    sc.addPyFile('./ParseNewTest.py')
    sc.addPyFile('./Parseommon.py')
    sc.addPyFile('./Parseh.py')
12.程序從checkpioint目錄恢復時報: ImportError: ('No module named

pache.spark.api.python.PythonException (Traceback (most recent call last):
  File "/mnt/hadoop/local/usercache/hive/appcache/application_1587440040048_0489/container_e149_1587440040048_0489_01_000003/pyspark.zip/pyspark/worker.py", line 98, in main
    command = pickleSer._read_with_length(infile)
  File "/mnt/hadoop/local/usercache/hive/appcache/application_1587440040048_0489/container_e149_1587440040048_0489_01_000003/pyspark.zip/pyspark/serializers.py", line 164, in _read_with_length
    return self.loads(obj)
  File "/mnt/hadoop/local/usercache/hive/appcache/application_1587440040048_0489/container_e149_1587440040048_0489_01_000003/pyspark.zip/pyspark/serializers.py", line 422, in loads
    return pickle.loads(obj)
  File "/mnt/hadoop/local/usercache/hive/appcache/application_1587440040048_0489/container_e149_1587440040048_0489_01_000003/pyspark.zip/pyspark/cloudpickle.py", line 664, in subimport
    __import__(name)
ImportError: ('No module named ParseBulkpper', <function subimport at 0x8e3a28>, ('ParseBulkZirapper',))
描述:程序第一次啓動能正常運行,第二次啓動從checkpoint目錄恢復時報找不到包。

解決辦法:在spark-submit中添加--files產生,此參數功能:--files FILES:逗號隔開的文件列表,這些文件將存放於每一個工作節點進程目錄下。

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章