请选择 进入手机版 | 继续访问电脑版

网络科技

    今日:191| 主题:284266
收藏本版
互联网、科技极客的综合动态。

[其他] Data Mining in Python: A Guide

[复制链接]
柒染划破泪浅 发表于 2016-10-5 04:45:23
499 17
Data mining and algorithms

     Data mining is t  he process of discovering predictive information from the analysis of large databases. For a data scientist, data mining can be a vague and daunting task – it requires a diverse set of skills and knowledge of many data mining techniques to take raw data and successfully get insights from it.  You’ll want to understand the foundations of statistics , and different programming languages that can help you with data mining at scale.   
    This guide will provide an example-filled introduction to data mining using Python, one of the most widely used data mining tools – from cleaning and data organization to applying machine learning algorithms. First, let’s get a better understanding of data mining and how it is accomplished.
     A data mining definition
     The desired outcome from data mining is to create a model from a given dataset that can have its insights generalized to similar datasets. A real-world example of a successful data mining application can be seen in automatic fraud detection from banks and credit institutions.
    Your bank likely has a policy to alert you if they detect any suspicious activity on your account – such as repeated ATM withdrawals or large purchases in a state outside of your registered residence. How does this relate to data mining? Data scientists created this system by applying algorithms to classify and predict whether a transaction is fraudulent by comparing it against a historical pattern of fraudulent and non-fraudulent charges. The model “knows” that if you live in San Diego, California, it’s highly likely that the thousand dollar purchases charged to a scarcely populated Russian province were not legitimate.
    That is just one of a number of the powerful applications of data mining. Other applications of data mining include genomic sequencing, social network analysis, or crime imaging – but the most common use case is for analyzing aspects of the consumer life cycle. Companies use data mining to discover consumer preferences, classify different consumers based on their purchasing activity, and determine what makes for a well-paying customer – information that can have profound effects on improving revenue streams and cutting costs.
     If you’re struggling to get good datasets to to begin your analysis, we’ve compiled 19 free datasets for your first data science project .
     What are some data mining techniques?
    There are multiple ways to build predictive models from datasets, and a data scientist should understand the concepts behind these techniques, as well as how to use code to produce similar models and visualizations. These techniques include:
   
         
  •   Regression  – Estimating the relationships between variables by optimizing the reduction of error.   
     

Data Mining in Python: A Guide

Data Mining in Python: A Guide-1-网络科技-understand,techniques,different,databases,knowledge

    An example of a scatterplot with a fitted linear regression model.
   
         
  •   Classification  – Identifying what category an object belongs to. An example is classifying email as spam or legitimate, or looking at a person’s credit score and approving or denying a loan request.     
  •   Cluster Analysis  – Finding natural groupings of data objects based upon the known characteristics of that data. An example could be seen in marketing, where analysis can reveal customer groupings with unique behavior – which could be applied in business strategy decisions.   
     

Data Mining in Python: A Guide

Data Mining in Python: A Guide-2-网络科技-understand,techniques,different,databases,knowledge

    An example of a scatter plot with the data segmented and colored by cluster.
   
         
  •   Association and Correlation Analysis  – Looking to see if there are unique relationships between variables that are not immediately obvious. An example would be the famous case of beer and diapers: men who bought diapers at the end of the week were much more likely to buy beer, so stores placed them close to each other to increase sales.     
  •   Outlier analysis  – Examining outliers to examine potential causes and reasons for said outliers. An example of which is the use of outlier analysis in fraud detection, and trying to determine if a pattern of behavior outside the norm is fraud or not.     
     Data mining for business is often performed with a transactional and live database that allows easy use of data mining tools for analysis. One example of which would be an  On-Line Analytical Processing server , or OLAP, which allows users to produce multi-dimensional analysis within the data server. OLAPs allow for business to query and analyze data without having to download static data files, which is helpful in situations where your database is growing on a daily basis.  However, for someone looking to learn data mining and practicing on their own, an iPython notebook  will be perfectly suited to handle most data mining tasks.  
     Let’s walk through how to use Python to perform data mining using two of the data mining algorithms described above: regression and  clustering.
    Creating a regression model in Python

    What is the problem we want to solve?

     We want to create an estimate of the linear relationship between variables, print the coefficients of correlation, and plot a line of best fit. For this analysis, I’ll be using data from the  House Sales in King’s County data set   from Kaggle. If you’re unfamiliar with Kaggle , it’s a fantastic resource for finding data sets good for practicing data science. The King’s County data has information on house prices and house characteristics – so let’s see if we can estimate the relationship between house price and the square footage of the house.  
    First step: Have the right data mining tools for the job – install Jupyter, and get familiar with a few modules.
          First things first, if you want to follow along, install Jupyter on your desktop. It’s a free platform that provides what is essentially a processer for iPython notebooks (.ipynb files) that is extremely intuitive to use.  Follow these instructions for installation  . Everything I do here will be completed in a “Python [Root]” file in Jupyter.
     We will be using the      Pandas      mo     dule of Python to clean and restructure our data. Pandas is an open-source module for working with data structures and analysis, one that is ubiquitous for data scientists who use Python. It allows for a data scientists to upload data in any format, and provides a simple platform organize, sort, and manipulate that data. If this is your first time using Pandas, check out      this awesome tutorial on the basic functions      !
        In [1]:
   
  1. import pandas as pd
  2. import matplotlib.pyplot as plt
  3. import numpy as np
  4. import scipy.stats as stats
  5. import seaborn as sns
  6. from matplotlib import rcParams

  7. %matplotlib inline
  8. %pylab inline
复制代码
   
  1. Populating the interactive namespace from numpy and matplotlib
复制代码
               In the code above I imported a few modules, here’s a breakdown of what they do:
     
          
  •   Numpy  – a necessary package for scientific computation. It includes an incredibly versatile structure for working with arrays, which are the primary data format that scikit-learn uses for input data.      
  •   Matplotlib  – the fundamental package for data visualization in Python. This module allows for the creation of everything from simple scatter plots to 3-dimensional contour plots. Note that from matplotlib we install pyplot, which is the highest order state-machine environment in the modules hierarchy (if that is meaningless to you don’t worry about it, just make sure you get it imported to your notebook). Using ‘%matplotlib inline’ is essential to make sure that all plots show up in your notebook.      
  •   Scipy  – a collection of tools for statistics in python. Stats is the scipy module that imports regression analysis functions.     
     Let’s break down how to apply data mining to solve a regression problem step-by-step! In real life you most likely won’t be handed a dataset ready to have machine learning techniques applied right away, so you will need to clean and organize the data first.
        In [2]:
   
  1. df = pd.read_csv('/Users/michaelrundell/Desktop/kc_house_data.csv')
  2. df.head()
复制代码
         Out[2]:
                                      id        date        price        bedrooms        bathrooms        sqft_living        sqft_lot                                  0        7129300520        20141013T000000        221900.0        3        1.00        1180        5650                      1        6414100192        20141209T000000        538000.0        3        2.25        2570        7242                      2        5631500400        20150225T000000        180000.0        2        1.00        770        10000                      3        2487200875        20141209T000000        604000.0        4        3.00        1960        5000                      4        1954400510        20150218T000000        510000.0        3        2.00        1680        8080                       Reading the csv file from Kaggle using pandas (pd.read_csv).
                  In [3]:
   
  1. df.isnull().any()
复制代码
         Out[3]:
   
  1. id               False
  2. date             False
  3. price            False
  4. bedrooms         False
  5. bathrooms        False
  6. sqft_living      False
  7. sqft_lot         False
  8. ...
  9. dtype: bool
复制代码
             Checking to see if any of our data has null values. If there were any, we’d drop or filter the null values out.
    In [4]:
   
  1. df.dtypes
复制代码
         Out[4]:
   
  1. id                 int64
  2. date              object
  3. price            float64
  4. bedrooms           int64
  5. bathrooms        float64
  6. sqft_living        int64
  7. sqft_lot           int64
  8. ...
  9. dtype: object
复制代码
       Checking out the data types for each of our variables. We want to get a sense of whether or not data is numerical (int64, float64) or not (object).  
     I imported the data frame from the csv file using Pandas, and the first thing I did was make sure it reads properly. I also used the “isnull()” function to make sure that none of my data is unusable for regression. In real life, a single column may have data in the form of integers, strings, or NaN, all in one place – meaning that you need to check to make sure the types are matching and are suitable for regression. This dataset happens to have been very rigorously prepared, something you won’t see often in your own database.
   Next: Simple exploratory analysis and regression results.

   Let’s get an understanding of the data before we go any further, it’s important to look at the shape of the data – and to double check if the data is reasonable. Corrupted data is not uncommon so it’s good practice to always run two checks: first, use df.describe() to look at all the variables in your analysis. Second, plot histograms of the variables that the analysis is targeting using plt.pyplot.hist().
           In [5]:
   
  1. df.describe()
复制代码
         Out[5]:
                             price       bedrooms       bathrooms       sqft_living                             count       21613       21613       21613       21613                   mean       540088.10       3.37       2.11       2079.90                   std       367127.20       0.93       0.77       918.44                   min       75000.00       0.00       0.00       290.00                   25%       321950.00       3.00       1.75       1427.00                   50%       450000.00       3.00       2.25       1910.00                   75%       645000.00       4.00       2.50       2550.00                   max       7700000.00       33.00       8.00       13540.00                               Quick takeaways: We are working with a dataset that contains 21,613 observations, mean price is approximately $540k, median price is approximately $450k, and the average house’s area is 2080 ft 2
    In [19]:
   
  1. fig = plt.figure(figsize=(12, 6))
  2. sqft = fig.add_subplot(121)
  3. cost = fig.add_subplot(122)

  4. sqft.hist(df.sqft_living, bins=80)
  5. sqft.set_xlabel('Ft^2')
  6. sqft.set_title("Histogram of House Square Footage")

  7. cost.hist(df.price, bins=80)
  8. cost.set_xlabel('Price ($)')
  9. cost.set_title("Histogram of Housing Prices")

  10. plt.show()
复制代码
      

Data Mining in Python: A Guide

Data Mining in Python: A Guide-3-网络科技-understand,techniques,different,databases,knowledge

             Using matplotlib (plt) we printed two histograms to observe the distribution of housing prices and square footage. What we find is that both variables have a distribution that is right-skewed.
     Now that we have a good sense of our data set and know the distributions of the variables we are trying to measure, let’s do some regression analysis. First we import statsmodels to get the least squares regression estimator function. The “Ordinary Least Squares” module will be doing the bulk of the work when it comes to crunching numbers for regression in Python.
               In [15]:
     
  1. import statsmodels.api as sm
  2. from statsmodels.formula.api import ols
复制代码
                      When you code to produce a linear regression summary with OLS with only two variables this will be the formula that you use:
       Reg = ols(‘Dependent variable ~ independent variable(s), dataframe).fit()
       print(Reg.summary())
       When we look at housing prices and square footage for houses in King’s county, we print out the following summary report:
       In [16]:
      
  1. Populating the interactive namespace from numpy and matplotlib0
复制代码
         
  1. Populating the interactive namespace from numpy and matplotlib1
复制代码
                      An example of simple linear regression model summary output.
        When you print the summary of the OLS regression, all relevant information can be easily found, including R-squared, t-statistics, standard error, and the coefficients of correlation. Looking at the output, it’s clear that there is an extremely significant relationship between square footage and housing prices since there is an extremely high t-value of 144.920, and a  P>|t| of 0%–which essentially means that this relationship has a near-zero chance of being due to statistical variation or chance.
       This relationship also has a decent magnitude – for every additional 100 square-feet a house has, we can predict that house to be priced $28,000 dollars higher on average. It is easy to adjust this formula to include more than one independent variable, simply follow the formula:
       Reg = ols(‘Dependent variable ~ivar1 + ivar2 + ivar3… + ivarN, dataframe).fit()
       print(Reg.summary())
       In [26]:
      
  1. Populating the interactive namespace from numpy and matplotlib2
复制代码
         
  1. Populating the interactive namespace from numpy and matplotlib3
复制代码
                 An example of multivariate linear regression.
     In our multivariate regression output above, we learn that by using additional independent variables, such as the number of bedrooms, we can provide a model that fits the data better, as the R-squared for this regression has increased to 0.555. This means that we went from being able to explain about 49.3% of the variation in the model to 55.5% with the addition of a few more independent variables.
     Visualizing the regression results.

     Having the regression summary output is important for checking the accuracy of the regression model and data to be used for estimation and prediction – but visualizing the regression is an important step to take to communicate the results of the regression in a more digestible format.
     This section will rely entirely on Seaborn (sns), which has an incredibly simple and intuitive function for graphing regression lines with scatterplots. I chose to create a jointplot for square footage and price that shows the regression line as well as distribution plots for each variable.
        In [24]:
   
  1. Populating the interactive namespace from numpy and matplotlib4
复制代码
        
  1. Populating the interactive namespace from numpy and matplotlib5
复制代码
   

Data Mining in Python: A Guide

Data Mining in Python: A Guide-4-网络科技-understand,techniques,different,databases,knowledge

              That wraps up my regression example, but there are many other ways to perform regression analysis in python, especially when it comes to using certain techniques. For more on regression models, consult the resources below. Next we’ll cover cluster analysis.
   
         
  •               Visualizing linear relationships using Seaborn        – this documentation gives specific examples that show how to modify you regression plots, and display new features that you might not know how to code yourself. It also teaches you how to fit different kinds of models, such as quadratic or logistic models.
            Statistics in Python  – this tutorial covers different techniques for performing regression in python, and also will teach you how to do hypothesis testing and testing for interactions.
       
       If you want to learn about more data mining software that helps you with visualizing your results, you should look at these 31 free data visualization tools we’ve compiled.
       Creating a Clustering Model in Python

   What is the problem we want to solve?

    We want to create natural groupings for a set of data objects that might not be explicitly stated in the data itself. Our analysis will use data on the eruptions from Old Faithful, the famous geyser in Yellowstone Park. The data is found from  this Github repository  by Barney Govan. It contains only two attributes, waiting time between eruptions (minutes) and length of eruption (minutes). Having only two attributes makes it easy to create a simple k-means cluster model.
   What is a k-means cluster model?

   K-Means Cluster models work in the following way – all credit to this blog:
   
       
  • Start with a randomly selected set of k centroids (the supposed centers of the k clusters)   
  • Determine which observation is in which cluster, based on which centroid it is closest to (using the squared Euclidean distance: ∑pj=1(xij−xi′j)2 where p is the number of dimensions.   
  • Recalculate the centroids of each cluster by minimizing the squared Euclidean distance to each observation in the cluster   
  • Repeat 2. and 3. until the members of the clusters (and hence the positions of the centroids) no longer change.   
    If this is still confusing, check out  this helpful video  by Jigsaw Academy. For now, let’s move on to applying this technique to our Old Faithful data set.
   Step One: Exploratory Data Analysis

    You will need to install a few modules, including one new module called  Sci-kit Learn  – a collection of tools for machine learning and data mining in Python. Cluster is the sci-kit module that imports functions with clustering algorithms, hence why it is imported from sci-kit.
    First, let’s import all necessary modules into our iPython Notebook and do some exploratory data analysis .
            In [18]:
   
  1. Populating the interactive namespace from numpy and matplotlib6
复制代码
            Out[18]:
                                  eruptions        waiting                                  0        3.600        79                      1        1.800        54                      2        3.333        74                      3        2.283        62                      4        4.533        85                              Reading the old faithful csv and importing all necessary values
   All I’ve done is read the csv from my local directory, which happens to be my computer’s desktop, and shown the first 5 entries of the data. Fortunately I know this dataset has no columns with missing or NaN values, so we can skip the data cleaning section in this example. Let’s take a look at a basic scatterplot of the data.
            In [19]:
   
  1. Populating the interactive namespace from numpy and matplotlib7
复制代码
                  Out[19]:
     
  1. Populating the interactive namespace from numpy and matplotlib8
复制代码
               

Data Mining in Python: A Guide

Data Mining in Python: A Guide-5-网络科技-understand,techniques,different,databases,knowledge

             Renaming the columns and using matplotlib to create a simple scatterplot.
       Some quick notes on my process here: I renamed the columns – they don’t look any different to the naked eye, but the “waiting” column had an extra space before the word, and to prevent any confusion with further analysis I changed it to ensure I don’t forget or make any mistakes down the road.  
       Step two: Building the cluster model

       What we see is a scatter plot that has two clusters that are easily apparent, but the data set does not label any observation as belonging to either group. The next few steps will cover the process of visually differentiating the two groups. In the code below, I establish some important variables and alter the format of the data.
                   In [20]:
              
  1. Populating the interactive namespace from numpy and matplotlib9
复制代码
       Formatting and function creation.
        
             
  • I read the faithful dataframe as a numpy array in order for sci-kit to be able to read the data.         
  • K = 2 was chosen as the number of clusters because there are 2 clear groupings we are trying to create.         
  • The ‘kmeans’ variable is defined by the output called from the cluster module in sci-kit. We have it take on a K number of clusters, and fit the data in the array ‘faith’.        
        Now that we have set up the variables for creating a cluster model, let’s create a visualization. The code below will plot a scatter plot that colors by cluster, and gives final centroid locations. Explanation of specific lines of code can be found below.
                           In [21]:
         
  1. df = pd.read_csv('/Users/michaelrundell/Desktop/kc_house_data.csv')
  2. df.head()0
复制代码
                           

Data Mining in Python: A Guide

Data Mining in Python: A Guide-6-网络科技-understand,techniques,different,databases,knowledge

                     Creating a visualization of the cluster model.
           A quick breakdown of the code above:
           
                
  • All of the work done to group the data into 2 groups was done in the previous section of code where we used the command kmeans.fit(faith). This section of the code simply creates the plot that shows it.            
  • The ds variable is simply the original data, but reformatted to include the new color labels based on the number of groups – the number of integers in k.            
  • plt.plot calls the x-data, the y-data, the shape of the objects, and the size of the circles.            
  • The rest of the code displays the final centroids of the k-means clustering process, and controls the size and thickness of the centroid markers.           
           And here we have it – a simple cluster model. This code be adapted to include a different number of clusters, but for this problem it makes sense to include only 2 clusters. Now that we have these clusters that seem to be well defined, we can infer meaning from these two clusters. What do they stand for? The green cluster: consisting of mostly short eruptions with a brief waiting time between eruptions could be defined as ‘weak or rapid-fire’, while the blue cluster could be called ‘power’ eruptions.
            Of note: this technique is not adaptable for all datasets –  data scientist David Robinson  explains it perfectly in his article  that K-means clustering is “not a free lunch.” K-means has assumptions that fail if your data has uneven cluster probabilities (they don’t have approximately the same amount of observations in each cluster), or has non-spherical clusters. If you don’t think that your clustering problem will work well with K-means clustering, check out these resources on alternative cluster modeling techniques:
           
                
  •   Sci-kit Clustering Modules   this documentation has a nifty image that visually compares of the clustering algorithms in scikit-learn, as they look for different scatterplots. Using this documentation can point you to the right algorithm to use if you have a scatter plot similar to one of their examples. It also gives you some insight on how to evaluate your clustering model mathematically.            
  •   Clustering Algorithms  – this Powerpoint presentation from Stanford’s CS345 course, Data Mining, gives insight into different techniques – how they work, where they are effective and ineffective, etc. It is a great learning resource to understand how clustering works at a theoretical level.           
           Conclusion

           Data mining encompasses a number of predictive modeling techniques and you can use a variety of data mining software. To learn to apply these techniques using Python is difficult – it will take practice and diligence to apply these on your own data set. Early on you will run into innumerable bugs, error messages, and roadblocks. – but stay persistent and diligent in your data mining attempts. I hope that through looking at the code and creation process of the cluster and linear regression models above, you have learned that data mining is achievable, and can be finished with an efficient amount of code.
                                                         

Data Mining in Python: A Guide

Data Mining in Python: A Guide-7-网络科技-understand,techniques,different,databases,knowledge
微笑啊哦 发表于 2016-10-5 06:24:22
路边的野花不要,踩。  
回复 支持 反对

使用道具 举报

寒雁 发表于 2016-10-5 06:30:20
你身材很好,好的连孙悟空看见你,都会给你三棍子了。
回复 支持 反对

使用道具 举报

莹常华泽 发表于 2016-10-6 03:25:36
香烟不乖,所以我们抽烟…
回复 支持 反对

使用道具 举报

拥抱你1 发表于 2016-10-6 03:25:36
不要和地球人一般见识.
回复 支持 反对

使用道具 举报

蹲街角只为等待伱 发表于 2016-10-6 03:29:48
今年楼主不送礼,送礼就给楼下的!
回复 支持 反对

使用道具 举报

雷汉林 发表于 2016-10-6 03:29:51
如果你真的爱他,那么你必须容忍他部份的缺点。
回复 支持 反对

使用道具 举报

liluo1991 发表于 2016-10-6 03:29:56
最近压力山大啊!
回复 支持 反对

使用道具 举报

怜雪 发表于 2016-10-6 03:30:21
有十只羊,九只蹲在羊圈,一只蹲在猪圈.打一成语…
回复 支持 反对

使用道具 举报

迎蓉 发表于 2016-10-6 03:48:48
你长的比假奶粉的毒性都大,我看了就头大。
回复 支持 反对

使用道具 举报

我要投稿

推荐阅读


回页顶回复上一篇下一篇回列表
手机版/c.CoLaBug.com ( 粤ICP备05003221号 | 文网文[2010]257号 | 粤公网安备 44010402000842号 )

© 2001-2017 Comsenz Inc.

返回顶部 返回列表