Skip to content

Latest commit

 

History

History
101 lines (97 loc) · 4.88 KB

README.md

File metadata and controls

101 lines (97 loc) · 4.88 KB

Python grammer

“Python is an easy to learn, powerful programming language.” Those are the first words of the official Python Tutorial. Python语言广泛应用在数据分析、机器学习、大数据计算(PySpark)等领域,通过读 Fluent Python学习下它的语法。

Python基本变量类型

python一共定义了5个标准的数据类型:NumberStringListTupleDictionary,给变量赋值时不需声明类型,python会自动依据值做判断。

# drink被识别为string,price为float浮点类型
drink = 'café'
price = 10.5
# array中元素类型可不一致,tuple、float、string都可以,获取元素用array[index]
city_info = ['newyork', 23, 10.34, (35.689722, 139.691667)]

遍历array中元素的不同写法,forlambdaarray完成遍历、元素筛选(过滤ascii码大于127的字符):

symbols = ['o', '0', '¢', '£', '¥', '€', '¤']
# listcomps do everything the map and filter functions do
beyond_ascii = [ord(s) for s in symbols if ord(s) > 127]

# use python lambda expression
beyond_ascii = list(filter(lambda c: c > 127, map(ord, symbols)))

python对两个数组计算笛卡尔积,通过两个for语句从array中提取元素,然后进行自由组合,range(value)函数会生成从0~value间的整数数组:

# “Cartesian product using a list comprehension”
colors = ['black', 'white']
sizes = ['S', 'M', 'L']
tshirts = [(color, size) for color in colors for size in sizes]
# a: 0, b: 1, rest: [2, 3, 4]
a, b, *rest = range(5)

function及class的定义

通过def关键来定义函数,不需定义函数的返回类型,function.__doc__能获取函数的说明:

def factorial(n):
    """returns n!"""
    return 1 if n < 2 else n * factorial(n - 1)

# factorial(42): 1405006117752879898543142606244511569936384000000000, function doc: returns n!,
# type(factorial): <class 'function'>
print(f"factorial(42): {factorial(42)}, function doc: {factorial.__doc__}, "
        f"type(factorial): {type(factorial)}")

python中的类由class关键字来定义,其中__init__类似于constructor function,在class定义中@classmethod修饰类函数、@staticmethod修饰静态函数:

class Document():
  WELCOME_STR = 'Welcome! The context for this book is {}.'
  def __init__(self, title, author, context):
    print('init function called')
    self.title = title
    self.author = author
    self.__context = context

python中通过class BOWInvertedIndexEngine(SearchEngineBase)来实现继承,基类class作为参数放入派生类中,__init__(self)函数中先调用父类的构造函数:

class BOWInvertedIndexEngine(SearchEngineBase):
  def __init__(self):
    super(BOWInvertedIndexEngine, self).__init__()
    self.inverted_index = {}

lambda语法实现map-reduce函数,和其它语言一样,匿名函数写法简洁、可读性好:

array = [1, 2, 3, 4, 5]
map_list = map(lambda x: x * 2, array)  # [2, 4, 6, 8, 10]
reduce_value = reduce(lambda x, y: x * y, array)  # 1*2*3*4*5 = 120

并发、多线程数据处理

一般用asynciocreate_task()来创建任务,并通过await等待任务执行完成、或者使用asyncio.gather(*task)等待任务执行完成:

async def metrics():
  """用time()api来测试python代码执行的效率, asyncio.create_task()异步任务"""
  start_time = time.time()
  urls = ['url_1', 'url_2', 'url_3', 'url_4']
  tasks = [asyncio.create_task(crawl_page(url)) for url in urls]
  # for task in tasks:
  # 	await task
  # 另一种写法,asyncio.gather(*tasks)会等到所有task都跑完
  await asyncio.gather(*tasks)
  print(f"total used {round(time.time() - start_time, 2)} s for crawling webpage")

并行执行futures特性,当执行task需获取返回结果时,futures中的方法done(),表示相对应的操作是否完成-True表示完成,False表示没有完成。

def download_all(url_sites):
  with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
    # solution 2: executor.map()会对sites中的每个url,分别调用download_one函数,max_workers默认用cpu数
    # executor.map(download_one, url_sites)
    to_do = []
    for site in url_sites:
      future = executor.submit(download_one, site)
      to_do.append(future)

    for future in concurrent.futures.as_completed(to_do):
      # executor.submit()后会产生future结果,as_completed()为异步判断是否执行完
      future.result()

python中的多进程组件在multiprocessing包下,使用方式也较为简单,创建多进程池,通过pool.map()执行task

def find_sums(numbers):
  # multiprocessing.Pool()会创建进程池,将cpu_bound函数、数据作为key/value进行计算
  with multiprocessing.Pool() as pool:
    pool.map(cpu_bound, numbers)