The Importance of the Basics: Data Management
Every day we interact with data, whether it’s through our everyday decision-making or at a business operational level. However, data is unstructured and our goal as business and data analysts is to turn it into information and knowledge. Sometimes, analysts forget this goal and focus too much on building complex databases and data warehouses without paying attention to the quality and applicability of the data. Data management is the foundation block to everything else that has to do with analyzing data. If our first steps are thoughtless, then our whole structure could crumble. In data management, we gather, process, and examine our data based on the end goal established.
1. Information extraction - Our first step is information extraction, where we use different tools such as website, documents, and other sources to retrieve the specific information we’re trying to analyze. The data we extract needs to help answer the goal that’s set at the beginning. There are different ways we can extract information, a popular one being web scraping.
Data scraping is a process one by computer programs by extracting readable output from other sources, like websites as it’s in the case of web scraping. There are many web scraping tools to choose from. I’ve tested the webscrpaer.io developer tool and it’s proven useful when web scraping an extensive website, for example Zillow or Indeed.com.
2. Data processing – After we extract and collect our data, we now have to adjust its format and organization to make it easier to read and to allow the downstream process that will give us the knowledge we’re after. Data processing looks different depending on the data and our goal. A good place to start is looking at the structure of the data, including the data types, structural classification, and general observations, such as number of data points (observations) and number of factors (columns). Part of data processing is performing data cleaning tasks, which can be divided into three parts: parsing, correction, standardization.
- Parsing: identifying and separating data elements from an unstructured text. This involves looking for patterns and specifications that belong to the factors in our data. Regular expressions become useful in this process by helping us identify these patterns to match our set factors.
- Correction: addressing the values that may cause errors when streaming the data. This includes filling missing values, checking for conflicting values and fixing datatypes in our columns.
- Standardization: creating preferred formats and applying them to our data. Some common examples are date formats, addresses (including state notation) and commonly used institution names (including abbreviations).
3. Data Profiling – Our last step in data management is getting the knowledge from our now collected and clean data. Data profiling takes the data and prepares summaries about it. Among the summary outputs from data profiling we have, frequency distributions for each variable, identifying and categorizing missing values, checking data types, and assessing data integrity. With data profiling we can assess our data quality and decide if there’s any changes need to be made in the earlier stapes of the process.
Comments
Post a Comment