Courses/CS 491ab/Winter 2008/Marcella Wong

From CSWiki

Jump to: navigation, search

User:Marcella

Contents

[edit] Week 1 - January 4, 2008

Introduction

[edit] Week 2 - January 11, 2008

Business Intelligence
BI sits on top of an existing database application. BI basically takes existing data, organizes it so that it can be presented to help business owners make informed decisions. The BI System integrates databases from different business processes into one centralized system. From there, users can create customized reports to view revenue earned, business predictions and trends, global performance, etc.

BI System is used for data mining, data manipulation and data presentation. BI essentially takes existing data from, say a data warehouse, and aggregates, manipulates and organize the data so that it can allow for reports and analytics.

For example, if you want to figure out how many of each item to stock, you'd want to look for customer purchasing habits and patterns. So when you scan grocery items, all customer purchase info will go into a database. The BI application would take that data from your database, manipulate it, aggregate it and present it in a way such that the manager will be able to make an informed decision such as charts and diagrams.

So, what Pentaho is just the BI System that needs to "plug-in" to an existing business application. An existing business application helps you run business processes like ordering spare parts, setting payroll, hiring new employees, giving employee benefits, handling vendor information. There is a database underneath it and you add/remove information in the database.

Platform: Pentaho

In order to make sense of the data in BI, you would need additional tools to help business owners see and analyze the data. Pentaho offers 6 business applications:

  • Reporting
    • Display charts, tables, etc. to look at the business in different areas.
    • Ex. Status reports displaying revenue earned each year for the last 5 years.
    • Report types include Microsoft Excel, HTML, pdf, etc.


  • Analysis
    • Break down status reports to get more details.
    • Ex. Sales dropped overall in the last year. Was the overall sales revenue drop pulled down by a specific region? If yes, look into that region to see what is going on.
    • Allow for analysis on different business processes at once?


  • Dashboards
    • Shows key metrics to improve organizational performance.
    • Ex. Are we meeting targets for profitability?
    • Shows different desired reports together to compare.
    • Dashboards


  • Data Mining Engine
    • A set of machine learning algorithms that data can be ran through for analysis to find correlations to improve business.
    • Ex. Based on previous data, this customer will pay late.


  • Data Integration
    • Integrates different sets of data to one consistent version across all the sources.
    • Uses Extractions, Transformations and Loading capabilities
    • Ex. Importing data in database from text files and excel sheets.


  • BI Platform
    • Business Intelligence Platform that supports the above capabilities.
    • Supports the 5 tools listed above.

[edit] Week 3 - January 18, 2008

[edit] Project (Tentative):

Be able to let users create all types of reports that give statistics on Computer Science students.

Types of reports:

  • Ad hoc (allow users to create specific, customized queries)
    • Which elective courses had the most students for the past 3 years?
    • Number of students enrolled in certain classes
    • Ex. Did certain classes get harder over the years?
  • Default Reports
    • Professor - Student ratio
    • Which courses are getting maxed out
    • Which courses are not getting enough sign-up rates?
    • Popular/Unpopular elective courses
    • Statistics on Grades
      • Ex. Average grades throughout all CS classes
    • Students who are receiving grants/scholarships

Main Users:

  • Administration
  • Faculty Members
  • Professors


[edit] Getting Started with Pentaho

[edit] Start Pentaho BI Server

Image:start.jpg

[edit] Logging In

Image:login.jpg

[edit] Main Page

Image:main.jpg

[edit] Go -> Solutions

  • Collection of folders and documents in Pentaho. Contains all your Business Solutions - types of reports, etc.

Image:solutions.jpg

[edit] Reporting Example

  • Typical reporting/statistics pie chart

Image:pie_chart_example.jpg

[edit] Analysis Example

  • Starts off with main headers, which can be collapsed to view more details

Image:analysis_example.jpg

[edit] Dashboard Example

  • Dashboard - putting multiple reports/graphs together to compare
  • Interactive
  • Embed onto webpages
  • Samples > Dashboards > JSP Dashboard

Image:dashboard_example1.jpg
Image:dashboard_example2.jpg

[edit] Go -> My Workspace

  • Allow users to run reports in the background

Image:workspace_waiting.jpg
Image:workspace_complete.jpg

  • Allow users to select certain reports and schedule delivery dates (via e-mail), delivery format, etc.

Image:subscriptions.jpg

Subscriptions screenshot courtesy of Pentaho.com

Note: Talked to Dr. Pamula about GET system's database architecture. Result: GET system's database is really messy. It's better to create my own database, but talk to him next week.

[edit] Week 4 - January 25, 2008

Talked to Dr. Pamula about CSNS. Dr. Sun is actually in charge of the CSNS database. Since he is away for this quarter, I will have to e-mail him about it. Hopefully he'll be able to help.

[edit] Building a Pentaho Report

After downloading and starting Pentaho Report Designer, you should see this:

Image:reportdesigner.jpg

[edit] Palette Panel

  • Shows the different types of objects that are used to create the report.
    • Label: Anything. Properties Panel will allow you to adjust font, size, etc.
    • Text Field: Must match database attribute names. It pulls up the values of that attribute from database.
    • Number Field: Same as Text Field, but for numbers. Allow you to format numbers to certain styles.
    • Rectangle, Ellipse, Line: Add rectangles, ellipses and lines to report. Properties panel will handle formatting.


Image:palette.jpg

[edit] Structure

  • The text version of the layout of your report page


Image:structure.jpg

[edit] Adding to Data Set

Data Set: A set of tables in your database.

  • Connection Information
    • Specifies the connection between Report Designer and your database
  • Query Details
    • Allows you to type in your own SQL code to run queries.
    • SQLeonardo: Similar to Microsoft Access. Allows you to create queries without using SQL.


Image:dataset.jpg

Image:sqleonardo.jpg

  • Allows you to view the attributes and values in tables


Image:rawdata.jpg

  • When you want to know the attribute type (String, integer, etc.), click on it in the Structure panel and it will show up in the Properties panel.


Image:report_properties.jpg

[edit] Finished Building Report

  • Drag and drop objects onto the panels to create your report. Finished report in Designer can look something like this:


Image:report_design_view.jpg

  • The Structure panel will have a list of all the Objects being used to create the report.
  • Properties panel will display the appropriate properties for different objects when they are highlighted in the Structure panel.


Image:report_structure.jpg

  • An example of a completed report can look like this:


Image:report_complete.jpg

[edit] Report Types

  • There are many different ways to display your report.
    • PDF
    • HTML
    • CSV
    • etc.


Image:report_menu.jpg

[edit] Week 5 - February 1, 2008

[edit] Getting Started: Creating Solutions

[edit] What are Solutions?

To increase business profits, companies look at charts and reports based on historical data to make smart business decisions. From those charts/reports, decisions need to be made and a process towards that decision needs to be started. This entire process - identifying the problem/situation, making decisions, and starting the process to fixing the problem - is the Solution.

[edit] Workflows

A chain of events that execute a Solution.

Taken from Pentaho.com:

 For example: run a query to find out which departments are over budget, 
 run a budget report for each of those departments, and finally email each 
 report to the department manager.

Example for my project:

 Run a report of all the clubs on campus and their budgets. Run a list of 
 reports alongside it to see which clubs are requesting money. Based on that, 
 allocate appropriate amounts for each club and email the decision to the 
 Finance Department.

Pentaho provides an engine that allows for easy creation of workflows. The workflow engine uses XML Process Definition Language (XPDL) to execute workflows. XPDL is a standardized XML format for modeling tools and workflow engines. It helps with the interchanging of business process definitions.

[edit] Case Study

Taken from Pentaho.com:

Problem: An employee for a health care organization is working with an expired license.

Possible Outcomes:

  • Employer must find a replacement until employee's license is renewed.
  • Employee continues working, unnoticed by employer that he is working with expired license. Here, patient safety hazard and a liability risk occurs.

Business Goal: Reduce money spent to cover unlicensed employees until they renew their licenses. Increase patient safety. Reduce liability risks.

Current Process: Each manager maintains a list of employees with their license expiration.

Proposed Solution: Have a centralized database with all the employees, their departments, current licenses and expiration on those licenses. Every so often, send out reports to each department manager notifying their nearly-expired licenses.

Solution 1: Distribute a list of employees with their expired license dates once a month to department managers.

  • Managers must take time to read and look through the report.
    • Dates can get misread
  • Report doesn't get delivered.
  • Employees do not allow enough time for license renewal.

This solution is insufficient. The report gets delivered, but chances of reaching the business goal is not high.

Solution 2: Create workflows/rules that determine the amount of time the employees has to get his/her license renewal on time.

  • Send out reports to employee and managers each week with warnings that indicate the nearing license expiration.
  • Require acknowledgment from both manager and employees
  • Deliver failure-to-respond type notifications

Solution 2 offers a more sufficient and efficient solution. It not only delivers the report, but also makes sure that the managers and employees read the notices and acknowledges what needs to be done.

[edit] Action Sequence

All examples listed below are provided by Pentaho.com

An Action Sequence is an XML document that lists tasks that the solution engine can perform. The Solution Engine that Pentaho has is where the Action Sequences are read and executed.

[edit] Defining Input Variables

 <inputs>
     <region type="string">
        <default-value>Central</default-value>
        <sources>
           <request>REGION</request>
           <session>aRegion</session>
        </sources>
     </region>
     <from type="string">
        <default-value>joepentaho@pentaho.org</default-value>
     </from>
     <subject type="string">
        <default-value>Pentaho Example1</default-value>
     </subject>
     <message-plain type="string">
        <default-value>
                This is an email from the Pentaho BI Platform - Example1
        </default-value>
     </message-plain>
  </inputs>

There are 4 inputs in this Action Sequence. They are enclosed in the <input> tags:

  • region
    • Has type string
    • Has default value of Central
  • from
    • Has type string
    • Has default value of joepentaho@pentaho.org
  • subject
    • Has type string
    • Has default value of Pentaho Example1
  • message-plain
    • Has type string
    • Has default value of This is an email from the Pentaho BI Platform - Example1

[edit] Defining Actions - With JavaScript

Action Definitions define what actions that the Solution Engine must perform. In XML, events to be executed are enclosed in <action-definition> tags, which are enclosed in <actions> tag.

 <action-definition>
        <action-inputs>
           <region type="string"/>
        </action-inputs>
        <action-outputs>
           <rule_result type="string"/>
        </action-outputs>
        <component-name>JavascriptRule</component-name>
        <component-definition>
           <script>
              <![CDATA[
                 if ( "Central".equals( region ) ) {
                    rule_result = "joe@pentaho.org";
                 }
                 else {
                    rule_result = "suzy@pentaho.org";
                 }
             ]]>
           </script>
        </component-definition>
 </action-definition>

There are two variables that are being used in this action definition - region and rule_result. The region variable comes from the input that is given by the user earlier. The rule_result variable is newly created. It doesn't have a default value and does not accept any inputs. A value will be assigned to it.

Notice how there is a JavaScript Component. The lines written between the <component> tags are JavaScript. If the region variable has a value of "Central", then it will assign joe@pentaho.org to rule_result. If not, it will assign suzy@pentaho.org to rule_result.

[edit] Variable Creation and Assignment

 <action-definition>
        <action-inputs>
           <to type="string" mapping="rule_result"/>
           <from type="string"/>
           <subject type="string"/>
           <message-plain type="string"/>
        </action-inputs>
        <component-name>EmailComponent</component-name>
        <component-definition/>
 </action-definition>

In this segment, the from, subject, and message-plain have values that are assigned by the user input. The to variable is newly created. The component specified here is "EmailComponent", which means this entire action sequence is an email.

 <to type="string" mapping="rule_result"/>

The line above is analogous to: to = rule_result. You are assigning the value received from the if-statement (rule_result) and putting it into the variable to.

[edit] Data Types
  • content - Content is large amount of data that is generated within a component. An example of content is a PDF file that is generated from the reporting component.
 <attachment type="content">
     <sources>
         <runtime>report-output</runtime>
     </sources>
 </attachment>


  • long - A Java Long Object.
 <amount type="long">
     <default-value>25</default-value>
 </amount>
  • property-map - A property map of Java Strings

The property-map name is "veggie-data". Here, name = carrot, color = orange, etc. In database, the key usually maps to column names and the values usually map to that column's data.

 <veggie-data type="property-map ">
   <default-value type="property-map">
        <property-map>
           <entry key="name">carrot</entry>
           <entry key="color">orange</entry>
           <entry key="shape">cone</entry>
           <entry key="texture">bumpy</entry>
       </property-map>
   </default-value>
 </veggie-data>
  • property-map-list - A list of property maps of Java Strings.

There are several two property-maps in the example below.

 <fruit-data type="property-map-list">
   <default-value type="property-map-list">
       <property-map>
           <entry key="name">orange</entry>
           <entry key="color">orange</entry>
           <entry key="shape">sphere</entry>
           <entry key="texture">dimply</entry>
       </property-map>
       <property-map>
           <entry key="name">grapefruit</entry>
           <entry key="color">Yellow</entry>
           <entry key="shape">sphere</entry>
           <entry key="texture">dimply</entry>
       </property-map>
       <property-map>
   </default-value>
 </fruit-data>
  • string - A Java String

The default value for the variable, region, is "Central".

 <region type="string">
   <default-value>Central</default-value>
 </region>
  • string-list - A list of Java String Objects

Like an array of strings.

 <to-address type="string-list">
   <default-value type="string-list">
       <list-item>joe.pentaho@pentaho.org</list-item>
       <list-item>admin@pentaho.org</list-item>
       <list-item>sales@pentaho.org</list-item>
       <list-item>noxidj@pentaho.org</list-item>
   </default-value>
 </to-address >

[edit] Status update on CS Dept. database:

I talked to Dr. Sun about possibly using the CSNS or CS Dept. database. However, there is no database for CSNS nor the CS Dept. He mentioned that there is something that he has been working on that is similar to my project idea. The project he is working on deals with taking surveys on CSNS. Charts will be generated based on the survey answers. He did not disclose what kind of surveys they were, but his project and my project idea seem to have the same goal - generating charts to view an overall status. He, however, is not using BI platforms and said that using BI platforms might be interesting.

He suggested that I talk to someone else to see if I can access some student database. Ideally, I would get realistic student data (with real student information removed/altered). However, if that was not possible, Dr. Sun suggested I generate sample data in a systematic way - school environment schema, business environment schema, etc.

[edit] Week 6 - February 8, 2008

[edit] Architecture

[edit] Pentaho Server

Runs on a J2EE compliant web server

  • Apache
  • JBOSS AS
  • WebSphere
  • WebLogic
  • Oracle AS

The Pentaho Server contains engines used for:

  • reporting
  • analysis
  • business rules
  • e-mail
  • desktop notifications
  • workflow

As mentioned last week, the solutions engine is used for workflows. The workflow, or Solution Definition, consists of XML documents. Those documents contain:

  • XPDL (business process definitions)
  • data sources (where the data is coming from)
  • queries (a set of instructions for retrieving data from a data source by providing certain filters and other criteria)
  • reporting templates (how the report looks like, display features)
  • delivery and notification rules (notify certain groups for warnings, etc.)

XSL and CSS are used to generate and create online reports. They can be altered and customized.

Solution Definitions can be copied from one server to another and more than one Solution Definition can be executed at a time.


Image:ArchitectureDiagram.jpg

Solution Engine The Solution Engine is the main component of the BI Platform. It controls the access to the user components (Reporting, Analysis, Dashboards, etc.) It is the component that connects the data sources to the clients, like System Monitors and Web Service Clients.

The engine is called by the Scheduler and the Workflow Engine.

[edit] XML Schema

Taken from Pentaho.com and altered

<action-sequence> - required 
  <name> 
  <version>
  <title>
  <logging-level>
  <documentation>
      <author>
      <description>
      <help>
      <result-type>
      <icon>
  <inputs>
      <param-name type="data-type">
      <default-value>
      <sources>
  <outputs>
      <param-name type="data-type">
  <resources>
      <resource-name>
      <resource-type> - required
      <location> - required
      <mime-type>
      <actions [loop-on="parameter-name"]> - required
  <actions [loop-on="parameter-name"]>
  <action-definition> - required
  <action-inputs>
      <input-name type="data type" mapping="param">
  <action-outputs>
      <output-name type="data-type">
      <component-name> - required
      <component-definition> - required

<action-sequence> The top level node for the Action Sequence Document.

<name> The name of the Action Sequence and it must match the file name of the document.

<version> The version number of the document.

<title> This is also the name of the document, but this one is used for display purposes only. References to this document will use the name under the <name> tag.

<logging-level> Defines the logging-level that can be used for logging output. The values are:

  • TRACE
  • DEBUG
  • INFO
  • WARN
  • ERROR (default)
  • FATAL

<documentation> Contains descriptive nodes used for generating documentation.

<author> The author of the Action Sequence

<description> Short description describing the Action Sequence.

<help> Contains instructions for an end user to use this Action Sequence.

<result-type> Type of output the action sequence will generate. Values include:

  • Report
  • Process
  • Business Rule - For example, if (customerNo < 3000 ) return ‘Bob’ else return ‘Mary’
  • View
  • None

<icon> Thumbnail image that the navigation component will use. This is for display purposes only.

<inputs> Contains the input parameters

<param-name type="data-type"> under <inputs> param-name is the parameter that the action sequence is expecting at run time.

<default-value> If a parameter is not provided at run time, it will use the value provided under this tag.

<sources> Parameter providers are listed in the order that should be queried in order to retrieve the parameter. Values include:

  • request
  • session
  • runtime

<outputs> List of output parameters.

<resources> List of resource parameters

<resource-name> Name of the resource that the action sequence is expecting to use.

<resource-type> The name of the type of resource required. Values include:

  • solution-file
  • file
  • url

<location> The path to the resource. If "solution-file", then the location is a pathname relative to the current location. If "file", then a fully qualified path name is required. If "url", then a fully qualified URL is required.

<mime-type> MIME type of the resource.

<actions [loop-on="parameter-name"]> The loop-on attribute is optional. The <actions> tag contains more <actions> tags. A group of actions will be executed multiple times. For example, if a loop-on attribute is named "department" and it's a string-list of department names, then the parameter named department will be set to a different department name for each iteration.

<action-definition> At least one is required. It defines a complete call to a component for execution of a task.

<action-inputs> List of action-input parameters.

<input-name type="data" maping ="param"> input-name is the name of the parameter that the action definition is expecting to be available at run time. type attribute indicates the type of parameter. mapping allows this input to be mapped to an action sequence input or a previous action-definition output with a different name.

<action-outputs> List of action-output parameters

<output-name type="data-type" > When the component is done executing, it will have this output-name as the parameter.

<component-name> The name of the Java class that executes this action definition

<component-definition> The component that is specific to the XML definition.

[edit] XML Process Definition Language (XPDL)

A process definition is a list of what happens between the starting point and ending point. In this case, it is the a series of instructions between identifying the problem and creating a solution. In order to make process definitions reusable, a standardized format must be created. XPDL is standardized by the Workflow Management Coalition (WfMC). The WfMC defines standards for re-usability for workflow management systems.

[edit] XPDL Packages

An XPDL Package may contain the following:

  • references to subflows, which are separately defined
  • initial process definition must have the minimal set of attributes and objects in order to execute the process execution

Image:Basic_XPDL_flow.jpg

Taken from wfmc.org Definitions taken from wfmc.org
Workflow
The automation of a business process, in whole or part, during which documents, information or tasks are passed from one participant to another for action, according to a set of procedural rules.

Workflow Management System A system that defines, creates and manages the execution of workflows through the use of software, running on one or more workflow engines, which is able to interpret the process definition, interact with workflow participants and, where required, invoke the use of IT tools and applications.

A Workflow Management System consists of software components to store and interpret process definitions, create and manage workflow instances as they are executed, and control their interaction with workflow participants and applications.

Business Process A set of one or more linked procedures or activities which collectively realise a business objective or policy goal, normally within the context of an organisational structure defining functional roles and relationships.

For example, Finance. Increasing revenues, better budgeting.

Process Definition The representation of a business process in a form which supports automated manipulation, such as modelling, or enactment by a workflow management system. The process definition consists of a network of activities and their relationships, criteria to indicate the start and termination of the process, and information about the individual activities, such as participants, associated IT applications and data, etc.

The process definition results from work during the process definition mode

Activity A description of a piece of work that forms one logical step within a process. An activity may be a manual activity, which does not support computer automation, or a workflow (automated) activity. A workflow activity requires human and/or machine resources(s) to support process execution; where human resource is required an activity is allocated to a workflow participant.

An activity is typically the smallest unit of work which is scheduled by a workflow engine during process enactment

Automated Activity An activity which is capable of computer automation using a workflow management system to manage the activity during execution of the business process of which it forms a part.

During process execution, an automated (or workflow) activity is managed by the Workflow Management System (WMS).

Manual Activity An activity within a business process which is not capable of automation and hence lies outside the scope of a workflow management system. Such activities may be included within a process definition, for example to support modelling of the process, but do not form part of a resulting workflow.

Instances A process or activity instance is created and managed by a workflow management system for each separate invocation of the process or activity.

Process and activity instances are in accordance with the process definition.

Work Item The representation of the work to be processed in the context of an activity within a process instance.

An activity typically generates one or more work items which together constitute the task to be undertaken by the user within this activity.

Invoked Application An invoked application is a workflow application that is invoked by the workflow management system to automate an activity, fully or in part, or to support a workflow participant in processing a workitem.

[edit] Week 7 - February 15, 2008

[edit] Pentaho Data Integration (formerly known as "Kettle")

Kettle is an open source ETL tool that has been acquired by Pentaho.

ETL stands for Extract, Transform, and Load. ETL is the process that actually gets the data into a data warehouse. Data gets extracted from various outside data sources. It then gets transformed to fit business needs. For example, unnecessary fields can be removed from tables, originally integer fields can be changed to char fields, etc. Finally, the transformed data gets loaded into the data warehouse.

Since Kettle is an ETL tool, it can extract, transform and load data into Pentaho. Kettle consists of four components:

  • Spoon - is a graphically oriented end-user tool to model the flow of data from input through transformation to output. One such model is also called a transformation
  • Pan - is a command line tool that executes transformations modeled with Spoon
  • Chef - is a graphically oriented end-user tool used to model jobs. Jobs consist of job entries such as transformations, FTP downloads etc. that are placed in a flow of control
  • Kitchen - is a command line tool used to execute jobs created with Chef

Definitions taken from ttp://rpbouman.blogspot.com/

[edit] Spoon

The Spoon tool allows you to create transformations with a modeling technique. Spoon allows you to create transformations by modeling how you want your data to flow.

Image:Spoon.jpg



[Demo for Spoon]



Installing MySQL on Vista poses a problem. There seem to be some kind of security issue/firewall issues with Vista and MySQL. I will try to see if I can get it fixed and running by next week.

[edit] Week 8 - February 22, 2008

[edit] More on Pentaho Data Integration -Spoon

Spoon is an ETL tool provided by the Pentaho Data Integration package. It does exactly what ETL is - extracts, transforms and loads data into the data warehouse.

It takes a modeling approach, which does not require

[edit] Creating Basic Transformations

[edit] Core objects

These are the types of functions that are used for the extraction, loading and transformation of data.

[edit] Input

These are the types of files that data is coming from.
Image:core_objects.jpg

I have created sample data CSV files that I will use to manipulate.

Major.csv

Student_Num	Major			Minor		Tassel_Color
100000001	Computer Science	Math		Yellow
100000002	Nursing			None		Orange
100000003	Nursing			None		Orange
100000004	Biology			None		Green
100000005	Physics			None		Blue
100000006	Computer Science	Math		Yellow
100000007	Engineering		Math		Yellow
100000008	Engineering		None		Yellow
100000009	Engineering		None		Yellow
100000010	Biology			None		Green
100000011	Physics			None		Blue
100000012	Computer Science	None		Yellow
100000013	Computer Science	English		Yellow
100000014	Math			None		Red
100000015	Math			None		Red
100000016	Psychology		None		Purple
100000017	Psychology		English		Purple
100000018	Math			History		Red
100000019	Computer Science	English		Yellow
100000020	Computer Science	None		Yellow


Student.csv

CID		First_Name	Last_Name	Expected_Grad_Date
100000001	Amanda 		Smith		2009
100000002	Michael		Paine		2008
100000003	Louise		Allen		2009
100000004	George		Johnson		2009
100000005	Michelle	Williams	2009
100000006	Victor		Brown		2010
100000007	Jill		David		2008
100000008	Jack		Miller		2009
100000009	Dorris		Moore		2009
100000010	Nemo		Taylor		2008
100000011	Patrick		Anderson	2010
100000012	Frank		Jackson		2010
100000013	Priscilla	White		2009
100000014	Richard		Harris		2008
100000015	Amber		Thompson	2011
100000016	Tim		Clark		2010
100000017	Bob		Hall		2010
100000018	William		Young		2011
100000019	Ryan		Jones		2011
100000020	Larry		Baker		2011


[edit] Output

The types of output can be physical files, to the database, etc.

Image:output.jpg

[edit] Adding CSV File

To do that, all you have to do is drag and drop from the Input list the "CSV file input."

Double click on the icon in the workspace to change/add information.
Image:addCsvFile.jpg

After adjusting the attributes, you can click on the 'Preview' button to preview the data that is in the file.
Image:CsvPreview.jpg

[edit] Adding Stream Lookup

Stream lookup is looking up and comparing data values from different tables/files.

Connecting a 'hop' from the Major.csv and Students.csv files to the Stream Lookup means that the Stream Lookup step will be looking and comparing those two files.
Image:StreamLookup.jpg

Lookup step: The table you want to pull data from.
The key(s) to look up the value(s): attributes to compare.
Specify the fields to retrieve: The fields that you want to retrieve from the file that you specified in the Lookup step.

Image:StreamLookupWindow.jpg

My goal here is to link up the student names with their corresponding major and minor.
Since I want to list everything in the Student.csv file, I will look up the majors and minors. So, change the 'Lookup step' field accordingly.

In the Students.csv file, the student ID is called "CID." In the Major.csv file, the student ID is called "Student_Num." So I change the 'The key(s) to look up the value(s)' accordingly.

The only information that I want from the Major.csv file is the 'Major' and 'Minor' columns.

Image:StreamLookupWindowAltered.jpg


[edit] Adding Dummy

The dummy icon is placed there to allow me to see what happens when the transformation is ran.

Image:dummy.jpg

[edit] Preview Result

Click on the Preview button to see the results!

Image:result.jpg




[Demo for Spoon]

[edit] Week 9 - February 29, 2008

[edit] Pentaho Data Integration - with MySQL

[edit] Loading Data into MySQL

[edit] Using MySQL

Creating a table called Students:

CREATE TABLE Students (
 CID INT(9),
 First_Name VARCHAR(255),
 Last_Name VARCHAR(255),
 Expected_Grad_Date INT (4)
);

Populating Students with values:

INSERT INTO Students (CID,
 First_Name,
 Last_Name,
 Expected_Grad_Date)
VALUES ('100000001', 'Amanda', 'Smith', '2009'),
       ('100000002', 'Michael', 'Paine', '2008');


[edit] Using Spoon


Select a CSV/Excel file containing your data.
Image:selectCSV.jpg

Fill in the Target Table field. This will be the name of the table containing your data.
Image:loadStudents.jpg

Click New after the Connection text field and specify your connection.
Here, I am using MySQL and my database name is University.
Image:new_connection.jpg

Here is my database in MySQL:
Image:showdatabase.jpg

Click Test to make sure Spoon is able to connect to your database. Below shows the confirmation.
Image:click_test.jpg

Going back to the Table Output menu, click on SQL. Then Execute. This is the code that will be input into MySQL.
So here, it creates a table with the specified fields.
Image:execute.jpg

Going into MySQL, I check and see that the table Students is indeed created.
Image:showtables.jpg

After running the transformation, the data from the CSV file is successfully loaded into my database.
Image:data_loaded.jpg

[edit] Splitting Tables in MySQL


I have loaded the Major.csv file into my database and named the table, major. Now, I want to split the data from the major table into two separate tables (one for computer science majors only, the other for the rest of the majors) and load those two tables into my database.

This is how the transformation will look like:
Image:filter.jpg

How to filter the table:
Image:filterCompSci.jpg

The result in my database:
Image:twoTables.jpg

[edit] Result Transformation


The following transformation will combine all the steps mentioned above and perform those steps all at once:
Image:total.jpg

[edit] Week 10 - March 7, 2008

[edit] Final

[edit] Brief project description

BI basically takes existing data, organizes it so that it can be presented to help businesses and organizations make informed decisions. The BI System integrates databases from different business processes into one centralized system. From there, users can create customized reports to view revenue earned, business predictions and trends, global performance, etc.

Why is this different from a simple SQL query?

  • BI allows more flexible reporting capabilities
    • Allow more user-friendly ways of displaying data (graphs, charts, etc.)
    • Allows non-technical users to create their own reports and navigate/manipulate existing reports.
  • SQL operates on one database. BI can integrate multiple databases and data sources.

Pentaho offers a way to create queries and pull data from the database for those who don't know coding. At the same time, it lets experienced users dive down into the code and change it as it see fits.

Users can create ad hoc reports with back-end being Pentaho, an open source Business Intelligence suite. The databases that the reports will come from will be school databases - Student name, CIN, major, etc. Due to time, I might not have time to implement ad hoc reporting capabilities. Users will only be able to manipulate canned reports.

For example, an administrator needs to decide which class to cancel. He/she can make a decision based on which classes that were offered in the past had the least enrollment/least popular. Note: As of now, I am unable to retrieve actual data. The data that I am operating on now are sample/self-made data.

They can create reports like:

'Which class has the lowest enrollment in the CS Department?' 
List enrollment number for the following parameters:
Department: CS
Courses: All
Quarter From: Spring 2006
Quarter To: Winter 2008

A graph/chart would be generated.

There might be a separate database for courses, one for departments, one for teachers, etc. So Pentaho would have the ability to go into all these different databases and pull out information that the administrator needs.


Pictures from Pentaho samples
Image:login.jpg


Image:selectParam.jpg


Image:webreport.jpg


Goal:
Demonstrate my learning of Business Intelligence through a full-scale BI implementation using Pentaho.

[edit] Anticipated users

Two types of users:

1. Faculty members, school administrators, and whoever needs student/school status information.
2. People who want to learn more about Pentaho's Business Intelligence Suite and why it is better than performing simple SQL queries.

[edit] Main conceptual (i.e., user-level) objects

For users who want to explore Pentaho:
They will have to use the tools that Pentaho provides, for example, Data Integration and Report Designer.

For users who only want to see reports:
Pull down menus are mainly what the users will need in order to customize a report to generate.

I will provide the fields and parameters for the user based on the database.


Image:param.jpg

[edit] Primary conceptual (i.e., user-level) operations

For canned queries:

The primary operation the user will do is to simply specify the parameters for their customized report. I will provide the parameter values based on the database.

[edit] Why I am interested in this project

Businesses are always looking at ways to improve their businesses and maximize profits. To make smarter business decisions, companies rely on historical data and previous records. Pentaho deals with organizing databases and allowing users to pull information out of multiple databases to view. The whole Business Intelligence concept really interests me so I decided to take this opportunity to learn more about it.

Moreover, by implementing a BI system, I can show that I am able to learn something new on my own.

[edit] Status

I have learned how to use the Pentaho Data Integration tool well enough to meet the requirements needed for my project. I am able to use Spoon to manipulate databases.

I have also played around with the Pentaho Report Designer tool.