《大数据技术基础》上机三

  1. 编程实现文件合并和去重操作
对于两个输入文件,即文件A和文件B,请编写MapReduce程序,对两个文件进行合并,并剔除其中重复的内容,得到一个新的输出文件C。下面是输入文件和输出文件的一个样例供参考。 python代码
#!/bin/python

import sys
import os

argv = sys.argv
file1=sys.argv[2].split("/")[-1]
file2=sys.argv[3].split("/")[-1]

os.system("wget http://{}:50075/streamFile{}?nnaddr=127.0.0.1:9000 -O {} >> /dev/null".format(argv[1],argv[2],file1))

os.system("wget http://{}:50075/streamFile{}?nnaddr=127.0.0.1:9000 -O {} >> /dev/null".format(argv[1],argv[3],file2))

A = open(file1,'r').read().split('\n')
B = open(file2,'r').read().split('\n')
C = []

print(A)
print(B)

def m(x):
    if len(x) != 0 and not(x in C):
        print("Appending : " + x)
        C.append(x)

list(map(m, A))
list(map(m, B))

print("result after map:")
for x in C:
    print(x)
运行命令
python mapred1.py 192.168.43.206 /A.txt /B.txt
  1. 编写程序实现对输入文件的排序
现在有多个输入文件,每个文件中的每行内容均为一个整数。要求读取所有文件中的整数,进行升序排序后,输出到一个新的文件中,输出的数据格式为每行两个整数,第一个数字为第二个整数的排序位次,第二个整数为原待排列的整数。下面是输入文件和输出文件的一个样例供参考。
python代码
#!/bin/python


import sys
import os

argv = sys.argv
file1=sys.argv[2].split("/")[-1]
file2=sys.argv[3].split("/")[-1]
file3=sys.argv[4].split("/")[-1]
os.system("wget http://{}:50075/streamFile{}?nnaddr=127.0.0.1:9000 -O {} > /dev/null".format(argv[1],argv[2],file1))
os.system("wget http://{}:50075/streamFile{}?nnaddr=127.0.0.1:9000 -O {} > /dev/null".format(argv[1],argv[3],file2))
os.system("wget http://{}:50075/streamFile{}?nnaddr=127.0.0.1:9000 -O {} > /dev/null".format(argv[1],argv[4],file3))

def sp(x):
    return open(x,'r').read().split('\n')[0:-1]

A,B,C = list(map(sp, [file1,file2,file3]))

print(A)
print(B)
print(C)

def mer(x):
    return int(x)

D = list(map(mer,A+B+C))

D.sort()
print("result after map:")
for x in range(len(D)):
    print(str(x+1) + ' ' + str(D[x]))
执行命令
python mapred2.py 192.168.43.206 /3-2-1.txt /3-2-2.txt /3-2-3.txt
  1. 对给定的表格进行信息挖掘
下面给出一个child-parent的表格,要求挖掘其中的父子辈关系,给出祖孙辈关系的表格。
python代码
#!/bin/python

import sys
import os

argv = sys.argv
file1=sys.argv[2].split("/")[-1]

os.system("wget http://{}:50075/streamFile{}?nnaddr=127.0.0.1:9000 -O {} >> /dev/null".format(argv[1],argv[2],file1))


A = open(file1,'r').read().split('\n')[1:-1]
par={}
def m(x):
    if len(x) != 0 :
        kid, parent = x.split('\t')
        par[kid]=[]
        return [kid,parent]

B=list(map(m, A))

kids=list(par.keys())

def mkdic(x):
    par[x[0]].append(x[1])

list(map(mkdic,B))

def getparent(x):
    if x in par:
        return par[x]
    else:
        return ['']

def findgrand(x):
    return list(map(getparent,par[x]))

grands = list(map(findgrand,kids))

print('grandchild\tgrandparent')
for i in range(len(grands)):
    grand1, grand2 = grands[i]
    for g in grand1+grand2:
        if len(g) != 0:
            print(kids[i] + '\t\t' + g)
#命令
python mapred3.py 192.168.43.206  /parents.txt

“《大数据技术基础》上机三”的2个回复

  1. Hello, you used to write fantastic, but the last several posts have been kinda boring?I miss your tremendous writings. Past few posts are just a bit out of track! come on!

  2. Great items from you, man. I've have in mind your stuff prior to and you are simply extremely great. I actually like what you've obtained right here, really like what you are stating and the way during which you assert it. You're making it entertaining and you continue to care for to keep it wise. I cant wait to learn far more from you. That is really a wonderful website.

发表评论

电子邮件地址不会被公开。 必填项已用*标注