Questions tagged [hadoop]

Hadoop is an Apache open-source project that provides software for reliable and scalable distributed computing. The core consists of a distributed file system (HDFS) and a resource manager (YARN). Various other open-source projects, such as Apache Hive use Apache Hadoop as persistence layer.

0
votes
0answers
30 views

Kerberos error while connection to cloudera impala environment

While connection to kerberized hadoop environment error: [Simba]ImpalaJDBCDriver Unable to connect to server: [Simba]ImpalaJDBCDriver Kerberos Authentication failed. I've installed cloudera ...
0
votes
0answers
26 views

start-dfs.sh throws port 22: connection timeout error

I am trying to install hadoop on ubuntu in pseudo-distributed environment. start-dfs.sh (gives me an error) Starting namenodes on [10.1.37.12] 10.1.37.00: ssh: connect to host 10.1.37.12 port 22: ...
0
votes
1answer
15 views

How to fix this fatal error while running spark jobs on HDIinsight cluster? Session 681 unexpectedly reached final status 'dead'. See logs:

I am running pyspark code on HDIcluster and getting this error: The code failed because of a fatal error: Session 681 unexpectedly reached final status 'dead'. See logs: I don't have experience ...
0
votes
0answers
20 views

As the file is having blank lines and header, Code is failing with Nullexecptionerror

I am writing Map Reduce Program for the below problem statement - As the file is having blank lines and header, Code is failing with Nullexecptionerror, Can you please check my code and let me know ...
1
vote
1answer
16 views

Do avro and parquet formatted data have to be written within a hadoop infrastructure?

I've been researching the pros and cons of using avro, parquet, and other data sources for a project. If I am receiving input data from other groups of people who do not operate using Hadoop, will ...
-2
votes
1answer
14 views

Cannot create directory on hadoop through hadoop web console

I have set up hadoop environment using a Linux VMWare image. I am able to create file and folder using Linux terminal But when I use the web interface to do the same, I get error: Permission ...
0
votes
0answers
9 views

Hadoop MapReduce Java

I'm trying to learn hadoop and there is an example like below in documentation . I can't understand what does means this parameters . Please help me understanding map and reduce methods . I read ...
0
votes
1answer
15 views

Issue while passing a parameter to HQL

Error : bash: line 1: syntax error near unexpected token `(' While passing varchar(16) as a parameter in HQL. hive --hivevar id_variable_type="${id_variable_type}" -f $HIVE_SCRIPT_DIR/tds_validation....
0
votes
0answers
17 views

Kerberos java to impala keytab authentication with JAAS Configuration

I am trying to connect to Impala DB using kerberos KeyTab authentication and JAAS configuration. I am already able to connect to Impala DB by setting System property - "java.security.auth.login....
0
votes
0answers
13 views

In my hadoop project, I have set number of reduce task as 0 by “job.setNumReduceTasks(0)”, there is still a reduce task in job tracker page

I wrote a map-only hadoop project, but there is still a reduce task started. Job job = Job.getInstance(cfg); //指定本程序的jar包所在的本地路径 把jar包提交到yarn job.setJarByClass(WordcountDriver.class); ...
0
votes
0answers
14 views

Writing Map-Reduce output with custom file name prefix to Amazon S3

Output folders other than with default names are not created in S3 bucket. The reducer program uses org.apache.hadoop.mapreduce.lib.output.MultipleOutputs, to modify the reducer output file names, ...
1
vote
1answer
14 views

How data read happens in HBase?

We know HBase is deployed on top of Hadoop and HDFS. Also, we know when we want to read a file(or record) from HDFS, it takes a considerable amount of time using HDFS CLI. But even HBase uses HDFS, ...
0
votes
1answer
13 views

Ambari Agent Registration failed due to unsupported OS type

I was using Ambari Server UI to register a node as an agent and the registration kept failing. I checked the ambari-agent logs at /var/log/ambari-agent/ and found the following line in the logs ...
0
votes
1answer
21 views

Why is my throughput and average io rate got slower when i add node to my Hadoop cluster?

So i run a TestDFSIO on my cluster to see the throughput and avg io rate of read and write operation. i do 4 test: 4 files 256 MB each (total 1 GB) 2 files 256 MB each (total 512 MB) 2 files 128 MB ...
1
vote
1answer
32 views

How do you get the driver and executors to load and recognize the postgres driver in EMR with spark-submit?

BACKGROUND I am trying to run a spark-submit command that streams from Kafka and performs a JDBC sink into a postgres DB in AWS EMR (version 5.23.0) and using scala (version 2.11.12). The errors I see ...
1
vote
1answer
22 views

Required executor memory is above the max threshold of this cluster

I am running Spark on an 8 node cluster with yarn as a resource manager. I have 64GB memory per node, and I set the executor memory to 25GB, but I get the error: Required executor memory (25600MB) is ...
0
votes
0answers
14 views

Unable to import data into Hive from SQL Server

I am trying to import a table from SQL server to Hive but it is giving below error: ERROR tool.ImportTool: Import failed: There is no column found in the target table COLABORA. Please ensure that ...
-1
votes
1answer
22 views

pandas cumcount in pyspark

Currently attempting to convert a script I made from pandas to pyspark, I have a dataframe that contains data in the form of: index | letter ------|------- 0 | a 1 | a 2 | b 3 | c 4 ...
0
votes
0answers
9 views

How to fix “Version information is not found in metastore” in Sqoop

I'm using Sqoop import to copy MySQL tables to Hive. The following command manage to import the tables correctly to Hive sqoop import \ --connect jdbc:mysql://localhost:3301/dbname \ --username ...
-1
votes
0answers
24 views

How to split a dataframe based on column value with identifier in same order

I have been looking for solution on how to split a dataframe based on column values with identifier as #@@#. I am getting expected result when the combination is pair of two i.e. srcip#@@#destip or ...
0
votes
1answer
12 views

Can a hadoop slave node be made hadoop master node without incurring data loss

I have three node hadoop cluster. A B C A Master B slave C slave If i want to make C as master and the remaining as slaves will i face data loss or data corruption ?
2
votes
1answer
43 views

Does Hive preserve file order when selecting data

If I do select * from table1; in which order data will retrieve File order Or random order
-1
votes
0answers
22 views

Copying text file from download

I have this error when I tried to copy download text(words.txt) tried different syntax but its not succesful hadoop fs -copyFromLocal words.txt -copyFromLocal: Unknown command
0
votes
0answers
22 views

authentication error when trying to access WebHDFS

I'm trying to access webhdfs in a hadoop cluster running Cloudera using the following command !curl -i --negotiate -u : "http://namenode_address:50070/webhdfs/v1/user/?op=LISTSTATUS" and I'm getting ...
0
votes
0answers
18 views

Why map task writes its output to disk in MapReduce?

I heard that map task persists its data onto disk. But this will make mapreduce slower especially for iterative algorithms. Why do we want to persist the intermediate output to disk? Why don't we ...
0
votes
0answers
7 views

How to determine the number of requests/connections going to Hive Metastore Database from HMS?

We have couple of configurations to limit the number of worker threads in Hive Metastore Server as below, hive.metastore.server.max.threads hive.metastore.server.min.threads Source: https://cwiki....
-1
votes
0answers
10 views

Can not create a Path from a null string with copyFromLocal command

I have emp.txt file in my local system: C:\Users\jmani\Desktop\emp.txt i want to copy it from local to hdfs location. i just tried with the command: hadoop fs -copyFromLocal C:\Users\jmani\Desktop\...
0
votes
2answers
30 views

Spark Connect Hive to HDFS vs Spark connect HDFS directly and Hive on the top of it?

Summary of the problem: I have a perticular usecase to write >10gb data per day to HDFS via spark streaming. We are currently in the design phase. We want to write the data to HDFS (constraint) using ...
-2
votes
0answers
26 views

What is the advantage of using External tables in Hive?

I know that data in external table is not controlled by Hive and that data isn't removed when we drop an external table. Apart from this, why exactly we use External table in real-time?
0
votes
1answer
15 views

How we can limit the usages of VCores during Spark-submit

I am writing a Spark structured streaming application in which data processed with Spark, needs be sink'ed to s3 bucket. This is my development environment. Hadoop 2.6.0-cdh5.16.1 Spark version 2.3....
0
votes
0answers
24 views

how to encrypte ak and sk in core-site.xml when link to s3a using livy Rest API

I am using Livy rest api to submit spark jobs with using s3a to replace HDFS. I write AK and SK directly in core-site.xml, it works. I want to encrypte my AK and SK as I don't want others know my sk. ...
0
votes
2answers
32 views

How to check if HDFS directory is empty in Spark

I am using org.apache.hadoop.fs to check if the directory in HDFS is empty or not. I looked up the FileSystem api but I couldn't find anything close to it. Basically I want to check if directory is ...
-1
votes
1answer
43 views

I am trying to print just the size and basename

This scripts prints out the size of the database in GB and also the path of the database. $1/1024/1024/1024 shows the size in GB and $3 prints the path. I added the output needed and what I am ...
0
votes
0answers
20 views

How to aggregate and show top n item with a mapreduce job

Every line of my data is a string concatenation of a year and some characters. I want to get the top 3 characters ordered by descending frequency for each year. My input is like : 2012,A,B,C 2000,C,...
0
votes
2answers
38 views

Why is it that SUM(a + b) != SUM(a) + SUM(b) in Hive?

I'm running Hive 1.1.0 and see that for two bigint columns, active_users and inactive_users, SUM(active_users + inactive_users) < SUM(active_users) + SUM(inactive_users). Why is that the case, ...
0
votes
1answer
7 views

What does “moveToLocal: Option '-moveToLocal' is not implemented yet.” means?

I'm running a oozie workflow with some bash scripts in a hadoop environment (Hadoop 2.7.3). but my workflow is failing because my shell action get an error. After save the commands output in a file as ...
0
votes
0answers
19 views

How to identify disk space consumed for a particular directory pattern using hdfs command without listing all files under that directory?

How to identify disk space consumed for a particular directory pattern using hdfs command without listing all files under that directory? How hdfs dfs -du -h command can be clubbed efficiently with ...
0
votes
0answers
8 views

How to identify disk usage of a particular directory pattern using hdfs command without listing all files?

How to identify disk usage of a particular directory pattern using hdfs command without listing all files? or how can it be clubbed with hdfs dfs -du -h command ? Example like : hdfs dfs -du -h /data//...
0
votes
0answers
15 views

How to restart spark job when it fails with non-zero exit status

I'm trying to figure out how to restart a spark job, when it fails with a non-zero exit status like a Database connection exception or any other runtime exception. From the apache-spark documentation, ...
1
vote
1answer
23 views

MapReduce with 2 values

Is it possible to have two values in MapReduce? My csv looks like this: month, date, deviceCategory, totalTransactionRevenue 201608 20160801 Desktop 1000 201608 20160801 Mobile ...
0
votes
1answer
31 views

Which dependency I should add to get txt file in s3 with scala-spark using intelliJ?

I'm using IntelliJ ide and language scala, I want to access a text file stored in AWS S3, using IAM user credentials. I have not downloaded Hadoop on my system just using the dependencies. I have done ...
1
vote
1answer
13 views

Use hyphen in impala database name

I have a script which directly requires the creation of impala databases using hyphen in the datbase name. I am not able to do the same in impala shell using the below command. ******** default> ...
-3
votes
1answer
30 views

Is there away to share/access the hdfs among developers?

Me new to bigdata and hive. Need to work with other developer a spark streaming app, where it involves reading from Kafka and place it on hive/hdfs. The other developer uses/points to the same ...
0
votes
1answer
22 views

where can i find directory i have created using hadoop fs -mkdir in my ubuntu file system

i am new to hadoop , i have created directory using hadoop fs -mkdir -p /user/vinayak where is this folder located in my file system , i am using ubuntu 18.04. i am new to hadoop and unfamiliar to ...
0
votes
0answers
11 views

Login to hadoop from java program

I need to delete directory on hdfs recursively. I have used FileSystem to read and worked fine. But while deleting the console prints access is denied as the configuration that I am using fetch the ...
1
vote
3answers
41 views

Regular expression - only include 0 if in 2nd position of x.x.x

I am trying to figure out how to write a regular expression for a string of the format xx.xx.xx (but sometimes the third argument is not included) For example, the strings could be: 12.1 12.1.0 14.5....
0
votes
1answer
7 views

Migrating existing metadata from metastore(derby) and data from Hive 1.2 to Hive 2.4.3

I have freshly deployed Hive 2.4.3, however there are few existing tables with partition on older version of the Hive 1.2, I am using Derby as metadata store. What is the best way to migrate them to ...
0
votes
0answers
21 views

Hive remote postgres metastore

I was doing multi-node setup using Apache distribution .I was able to complete hadoop installation successfully (Hadoop 2.7.3). When I tried hive (Hive 2.3),its working without issues with the default ...
2
votes
0answers
17 views

pyspark parquet read Error on reading parquet files stored in hdfs: Block Missing Exception

I have data stored in parquet format on hdfs which I want to process using spark. Platform: Ubuntu 16.04 Spark 2.1.3 Hadoop 2.6.5 Here is listing of directory contents where the data is stored: ...
0
votes
0answers
16 views

How to use filter conditions on SHOW PARTITIONS clause on hive?

I am having hive table which is partitioned by date, app_name, src (3 partitions) I want to fire show partitions command in multiple ways like following // works show partitions mydb.tab_dt ...