Courses/CS 491ab/Winter 2008/Marcella Wong
From CSWiki
Contents |
[edit] Week 1 - January 4, 2008
Introduction
[edit] Week 2 - January 11, 2008
Business Intelligence
BI sits on top of an existing database application. BI basically takes existing data, organizes it so that it can be presented to help business owners make informed decisions. The BI System integrates databases from different business processes into one centralized system. From there, users can create customized reports to view revenue earned, business predictions and trends, global performance, etc.
BI System is used for data mining, data manipulation and data presentation. BI essentially takes existing data from, say a data warehouse, and aggregates, manipulates and organize the data so that it can allow for reports and analytics.
For example, if you want to figure out how many of each item to stock, you'd want to look for customer purchasing habits and patterns. So when you scan grocery items, all customer purchase info will go into a database. The BI application would take that data from your database, manipulate it, aggregate it and present it in a way such that the manager will be able to make an informed decision such as charts and diagrams.
So, what Pentaho is just the BI System that needs to "plug-in" to an existing business application. An existing business application helps you run business processes like ordering spare parts, setting payroll, hiring new employees, giving employee benefits, handling vendor information. There is a database underneath it and you add/remove information in the database.
Platform: Pentaho
In order to make sense of the data in BI, you would need additional tools to help business owners see and analyze the data. Pentaho offers 6 business applications:
- Reporting
- Display charts, tables, etc. to look at the business in different areas.
- Ex. Status reports displaying revenue earned each year for the last 5 years.
- Report types include Microsoft Excel, HTML, pdf, etc.
- Analysis
- Break down status reports to get more details.
- Ex. Sales dropped overall in the last year. Was the overall sales revenue drop pulled down by a specific region? If yes, look into that region to see what is going on.
- Allow for analysis on different business processes at once?
- Dashboards
- Shows key metrics to improve organizational performance.
- Ex. Are we meeting targets for profitability?
- Shows different desired reports together to compare.
- Dashboards
- Data Mining Engine
- A set of machine learning algorithms that data can be ran through for analysis to find correlations to improve business.
- Ex. Based on previous data, this customer will pay late.
- Data Integration
- Integrates different sets of data to one consistent version across all the sources.
- Uses Extractions, Transformations and Loading capabilities
- Ex. Importing data in database from text files and excel sheets.
- BI Platform
- Business Intelligence Platform that supports the above capabilities.
- Supports the 5 tools listed above.
[edit] Week 3 - January 18, 2008
[edit] Project (Tentative):
Be able to let users create all types of reports that give statistics on Computer Science students.
Types of reports:
- Ad hoc (allow users to create specific, customized queries)
- Which elective courses had the most students for the past 3 years?
- Number of students enrolled in certain classes
- Ex. Did certain classes get harder over the years?
- Default Reports
- Professor - Student ratio
- Which courses are getting maxed out
- Which courses are not getting enough sign-up rates?
- Popular/Unpopular elective courses
- Statistics on Grades
- Ex. Average grades throughout all CS classes
- Students who are receiving grants/scholarships
Main Users:
- Administration
- Faculty Members
- Professors
[edit] Getting Started with Pentaho
[edit] Start Pentaho BI Server
[edit] Logging In
[edit] Main Page
[edit] Go -> Solutions
- Collection of folders and documents in Pentaho. Contains all your Business Solutions - types of reports, etc.
[edit] Reporting Example
- Typical reporting/statistics pie chart
[edit] Analysis Example
- Starts off with main headers, which can be collapsed to view more details
[edit] Dashboard Example
- Dashboard - putting multiple reports/graphs together to compare
- Interactive
- Embed onto webpages
- Samples > Dashboards > JSP Dashboard
[edit] Go -> My Workspace
- Allow users to run reports in the background
- Allow users to select certain reports and schedule delivery dates (via e-mail), delivery format, etc.
Subscriptions screenshot courtesy of Pentaho.com
Note: Talked to Dr. Pamula about GET system's database architecture. Result: GET system's database is really messy. It's better to create my own database, but talk to him next week.
[edit] Week 4 - January 25, 2008
Talked to Dr. Pamula about CSNS. Dr. Sun is actually in charge of the CSNS database. Since he is away for this quarter, I will have to e-mail him about it. Hopefully he'll be able to help.
[edit] Building a Pentaho Report
After downloading and starting Pentaho Report Designer, you should see this:
[edit] Palette Panel
- Shows the different types of objects that are used to create the report.
- Label: Anything. Properties Panel will allow you to adjust font, size, etc.
- Text Field: Must match database attribute names. It pulls up the values of that attribute from database.
- Number Field: Same as Text Field, but for numbers. Allow you to format numbers to certain styles.
- Rectangle, Ellipse, Line: Add rectangles, ellipses and lines to report. Properties panel will handle formatting.
[edit] Structure
- The text version of the layout of your report page
[edit] Adding to Data Set
Data Set: A set of tables in your database.
- Connection Information
- Specifies the connection between Report Designer and your database
- Query Details
- Allows you to type in your own SQL code to run queries.
- SQLeonardo: Similar to Microsoft Access. Allows you to create queries without using SQL.
- Allows you to view the attributes and values in tables
- When you want to know the attribute type (String, integer, etc.), click on it in the Structure panel and it will show up in the Properties panel.
[edit] Finished Building Report
- Drag and drop objects onto the panels to create your report. Finished report in Designer can look something like this:
- The Structure panel will have a list of all the Objects being used to create the report.
- Properties panel will display the appropriate properties for different objects when they are highlighted in the Structure panel.
- An example of a completed report can look like this:
[edit] Report Types
- There are many different ways to display your report.
- HTML
- CSV
- etc.
[edit] Week 5 - February 1, 2008
[edit] Getting Started: Creating Solutions
[edit] What are Solutions?
To increase business profits, companies look at charts and reports based on historical data to make smart business decisions. From those charts/reports, decisions need to be made and a process towards that decision needs to be started. This entire process - identifying the problem/situation, making decisions, and starting the process to fixing the problem - is the Solution.
[edit] Workflows
A chain of events that execute a Solution.
Taken from Pentaho.com:
For example: run a query to find out which departments are over budget, run a budget report for each of those departments, and finally email each report to the department manager.
Example for my project:
Run a report of all the clubs on campus and their budgets. Run a list of reports alongside it to see which clubs are requesting money. Based on that, allocate appropriate amounts for each club and email the decision to the Finance Department.
Pentaho provides an engine that allows for easy creation of workflows. The workflow engine uses XML Process Definition Language (XPDL) to execute workflows. XPDL is a standardized XML format for modeling tools and workflow engines. It helps with the interchanging of business process definitions.
[edit] Case Study
Taken from Pentaho.com:
Problem: An employee for a health care organization is working with an expired license.
Possible Outcomes:
- Employer must find a replacement until employee's license is renewed.
- Employee continues working, unnoticed by employer that he is working with expired license. Here, patient safety hazard and a liability risk occurs.
Business Goal: Reduce money spent to cover unlicensed employees until they renew their licenses. Increase patient safety. Reduce liability risks.
Current Process: Each manager maintains a list of employees with their license expiration.
Proposed Solution: Have a centralized database with all the employees, their departments, current licenses and expiration on those licenses. Every so often, send out reports to each department manager notifying their nearly-expired licenses.
Solution 1: Distribute a list of employees with their expired license dates once a month to department managers.
- Managers must take time to read and look through the report.
- Dates can get misread
- Report doesn't get delivered.
- Employees do not allow enough time for license renewal.
This solution is insufficient. The report gets delivered, but chances of reaching the business goal is not high.
Solution 2: Create workflows/rules that determine the amount of time the employees has to get his/her license renewal on time.
- Send out reports to employee and managers each week with warnings that indicate the nearing license expiration.
- Require acknowledgment from both manager and employees
- Deliver failure-to-respond type notifications
Solution 2 offers a more sufficient and efficient solution. It not only delivers the report, but also makes sure that the managers and employees read the notices and acknowledges what needs to be done.
[edit] Action Sequence
All examples listed below are provided by Pentaho.comAn Action Sequence is an XML document that lists tasks that the solution engine can perform. The Solution Engine that Pentaho has is where the Action Sequences are read and executed.
[edit] Defining Input Variables
<inputs>
<region type="string">
<default-value>Central</default-value>
<sources>
<request>REGION</request>
<session>aRegion</session>
</sources>
</region>
<from type="string">
<default-value>joepentaho@pentaho.org</default-value>
</from>
<subject type="string">
<default-value>Pentaho Example1</default-value>
</subject>
<message-plain type="string">
<default-value>
This is an email from the Pentaho BI Platform - Example1
</default-value>
</message-plain>
</inputs>
There are 4 inputs in this Action Sequence. They are enclosed in the <input> tags:
- region
- Has type string
- Has default value of Central
- from
- Has type string
- Has default value of joepentaho@pentaho.org
- subject
- Has type string
- Has default value of Pentaho Example1
- message-plain
- Has type string
- Has default value of This is an email from the Pentaho BI Platform - Example1
[edit] Defining Actions - With JavaScript
Action Definitions define what actions that the Solution Engine must perform. In XML, events to be executed are enclosed in <action-definition> tags, which are enclosed in <actions> tag.
<action-definition>
<action-inputs>
<region type="string"/>
</action-inputs>
<action-outputs>
<rule_result type="string"/>
</action-outputs>
<component-name>JavascriptRule</component-name>
<component-definition>
<script>
<![CDATA[
if ( "Central".equals( region ) ) {
rule_result = "joe@pentaho.org";
}
else {
rule_result = "suzy@pentaho.org";
}
]]>
</script>
</component-definition>
</action-definition>
There are two variables that are being used in this action definition - region and rule_result. The region variable comes from the input that is given by the user earlier. The rule_result variable is newly created. It doesn't have a default value and does not accept any inputs. A value will be assigned to it.
Notice how there is a JavaScript Component. The lines written between the <component> tags are JavaScript. If the region variable has a value of "Central", then it will assign joe@pentaho.org to rule_result. If not, it will assign suzy@pentaho.org to rule_result.
[edit] Variable Creation and Assignment
<action-definition>
<action-inputs>
<to type="string" mapping="rule_result"/>
<from type="string"/>
<subject type="string"/>
<message-plain type="string"/>
</action-inputs>
<component-name>EmailComponent</component-name>
<component-definition/>
</action-definition>
In this segment, the from, subject, and message-plain have values that are assigned by the user input. The to variable is newly created. The component specified here is "EmailComponent", which means this entire action sequence is an email.
<to type="string" mapping="rule_result"/>
The line above is analogous to: to = rule_result. You are assigning the value received from the if-statement (rule_result) and putting it into the variable to.
[edit] Data Types
- content - Content is large amount of data that is generated within a component. An example of content is a PDF file that is generated from the reporting component.
<attachment type="content">
<sources>
<runtime>report-output</runtime>
</sources>
</attachment>
- long - A Java Long Object.
<amount type="long">
<default-value>25</default-value>
</amount>
- property-map - A property map of Java Strings
The property-map name is "veggie-data". Here, name = carrot, color = orange, etc. In database, the key usually maps to column names and the values usually map to that column's data.
<veggie-data type="property-map ">
<default-value type="property-map">
<property-map>
<entry key="name">carrot</entry>
<entry key="color">orange</entry>
<entry key="shape">cone</entry>
<entry key="texture">bumpy</entry>
</property-map>
</default-value>
</veggie-data>
- property-map-list - A list of property maps of Java Strings.
There are several two property-maps in the example below.
<fruit-data type="property-map-list">
<default-value type="property-map-list">
<property-map>
<entry key="name">orange</entry>
<entry key="color">orange</entry>
<entry key="shape">sphere</entry>
<entry key="texture">dimply</entry>
</property-map>
<property-map>
<entry key="name">grapefruit</entry>
<entry key="color">Yellow</entry>
<entry key="shape">sphere</entry>
<entry key="texture">dimply</entry>
</property-map>
<property-map>
</default-value>
</fruit-data>
- string - A Java String
The default value for the variable, region, is "Central".
<region type="string"> <default-value>Central</default-value> </region>
- string-list - A list of Java String Objects
Like an array of strings.
<to-address type="string-list">
<default-value type="string-list">
<list-item>joe.pentaho@pentaho.org</list-item>
<list-item>admin@pentaho.org</list-item>
<list-item>sales@pentaho.org</list-item>
<list-item>noxidj@pentaho.org</list-item>
</default-value>
</to-address >
[edit] Status update on CS Dept. database:
I talked to Dr. Sun about possibly using the CSNS or CS Dept. database. However, there is no database for CSNS nor the CS Dept. He mentioned that there is something that he has been working on that is similar to my project idea. The project he is working on deals with taking surveys on CSNS. Charts will be generated based on the survey answers. He did not disclose what kind of surveys they were, but his project and my project idea seem to have the same goal - generating charts to view an overall status. He, however, is not using BI platforms and said that using BI platforms might be interesting.
He suggested that I talk to someone else to see if I can access some student database. Ideally, I would get realistic student data (with real student information removed/altered). However, if that was not possible, Dr. Sun suggested I generate sample data in a systematic way - school environment schema, business environment schema, etc.
[edit] Week 6 - February 8, 2008
[edit] Architecture
[edit] Pentaho Server
Runs on a J2EE compliant web server
- Apache
- JBOSS AS
- WebSphere
- WebLogic
- Oracle AS
The Pentaho Server contains engines used for:
- reporting
- analysis
- business rules
- desktop notifications
- workflow
As mentioned last week, the solutions engine is used for workflows. The workflow, or Solution Definition, consists of XML documents. Those documents contain:
- XPDL (business process definitions)
- data sources (where the data is coming from)
- queries (a set of instructions for retrieving data from a data source by providing certain filters and other criteria)
- reporting templates (how the report looks like, display features)
- delivery and notification rules (notify certain groups for warnings, etc.)
XSL and CSS are used to generate and create online reports. They can be altered and customized.
Solution Definitions can be copied from one server to another and more than one Solution Definition can be executed at a time.
Solution Engine The Solution Engine is the main component of the BI Platform. It controls the access to the user components (Reporting, Analysis, Dashboards, etc.) It is the component that connects the data sources to the clients, like System Monitors and Web Service Clients.
The engine is called by the Scheduler and the Workflow Engine.
[edit] XML Schema
Taken from Pentaho.com and altered
<action-sequence> - required
<name>
<version>
<title>
<logging-level>
<documentation>
<author>
<description>
<help>
<result-type>
<icon>
<inputs>
<param-name type="data-type">
<default-value>
<sources>
<outputs>
<param-name type="data-type">
<resources>
<resource-name>
<resource-type> - required
<location> - required
<mime-type>
<actions [loop-on="parameter-name"]> - required
<actions [loop-on="parameter-name"]>
<action-definition> - required
<action-inputs>
<input-name type="data type" mapping="param">
<action-outputs>
<output-name type="data-type">
<component-name> - required
<component-definition> - required
<action-sequence> The top level node for the Action Sequence Document.
<name> The name of the Action Sequence and it must match the file name of the document.
<version> The version number of the document.
<title> This is also the name of the document, but this one is used for display purposes only. References to this document will use the name under the <name> tag.
<logging-level> Defines the logging-level that can be used for logging output. The values are:
- TRACE
- DEBUG
- INFO
- WARN
- ERROR (default)
- FATAL
<documentation> Contains descriptive nodes used for generating documentation.
<author> The author of the Action Sequence
<description> Short description describing the Action Sequence.
<help> Contains instructions for an end user to use this Action Sequence.
<result-type> Type of output the action sequence will generate. Values include:
- Report
- Process
- Business Rule - For example, if (customerNo < 3000 ) return ‘Bob’ else return ‘Mary’
- View
- None
<icon> Thumbnail image that the navigation component will use. This is for display purposes only.
<inputs> Contains the input parameters
<param-name type="data-type"> under <inputs> param-name is the parameter that the action sequence is expecting at run time.
<default-value> If a parameter is not provided at run time, it will use the value provided under this tag.
<sources> Parameter providers are listed in the order that should be queried in order to retrieve the parameter. Values include:
- request
- session
- runtime
<outputs> List of output parameters.
<resources> List of resource parameters
<resource-name> Name of the resource that the action sequence is expecting to use.
<resource-type> The name of the type of resource required. Values include:
- solution-file
- file
- url
<location> The path to the resource. If "solution-file", then the location is a pathname relative to the current location. If "file", then a fully qualified path name is required. If "url", then a fully qualified URL is required.
<mime-type> MIME type of the resource.
<actions [loop-on="parameter-name"]> The loop-on attribute is optional. The <actions> tag contains more <actions> tags. A group of actions will be executed multiple times. For example, if a loop-on attribute is named "department" and it's a string-list of department names, then the parameter named department will be set to a different department name for each iteration.
<action-definition> At least one is required. It defines a complete call to a component for execution of a task.
<action-inputs> List of action-input parameters.
<input-name type="data" maping ="param"> input-name is the name of the parameter that the action definition is expecting to be available at run time. type attribute indicates the type of parameter. mapping allows this input to be mapped to an action sequence input or a previous action-definition output with a different name.
<action-outputs> List of action-output parameters
<output-name type="data-type" > When the component is done executing, it will have this output-name as the parameter.
<component-name> The name of the Java class that executes this action definition
<component-definition> The component that is specific to the XML definition.
[edit] XML Process Definition Language (XPDL)
A process definition is a list of what happens between the starting point and ending point. In this case, it is the a series of instructions between identifying the problem and creating a solution. In order to make process definitions reusable, a standardized format must be created. XPDL is standardized by the Workflow Management Coalition (WfMC). The WfMC defines standards for re-usability for workflow management systems.
[edit] XPDL Packages
An XPDL Package may contain the following:
- references to subflows, which are separately defined
- initial process definition must have the minimal set of attributes and objects in order to execute the process execution
Taken from wfmc.org
Definitions taken from wfmc.org
Workflow
The automation of a business process, in whole or part, during which documents, information or tasks are passed from one participant to another for action, according to a set of procedural rules.
Workflow Management System A system that defines, creates and manages the execution of workflows through the use of software, running on one or more workflow engines, which is able to interpret the process definition, interact with workflow participants and, where required, invoke the use of IT tools and applications.
A Workflow Management System consists of software components to store and interpret process definitions, create and manage workflow instances as they are executed, and control their interaction with workflow participants and applications.
Business Process A set of one or more linked procedures or activities which collectively realise a business objective or policy goal, normally within the context of an organisational structure defining functional roles and relationships.
For example, Finance. Increasing revenues, better budgeting.
Process Definition The representation of a business process in a form which supports automated manipulation, such as modelling, or enactment by a workflow management system. The process definition consists of a network of activities and their relationships, criteria to indicate the start and termination of the process, and information about the individual activities, such as participants, associated IT applications and data, etc.
The process definition results from work during the process definition mode
Activity A description of a piece of work that forms one logical step within a process. An activity may be a manual activity, which does not support computer automation, or a workflow (automated) activity. A workflow activity requires human and/or machine resources(s) to support process execution; where human resource is required an activity is allocated to a workflow participant.
An activity is typically the smallest unit of work which is scheduled by a workflow engine during process enactment
Automated Activity An activity which is capable of computer automation using a workflow management system to manage the activity during execution of the business process of which it forms a part.
During process execution, an automated (or workflow) activity is managed by the Workflow Management System (WMS).
Manual Activity An activity within a business process which is not capable of automation and hence lies outside the scope of a workflow management system. Such activities may be included within a process definition, for example to support modelling of the process, but do not form part of a resulting workflow.
Instances A process or activity instance is created and managed by a workflow management system for each separate invocation of the process or activity.
Process and activity instances are in accordance with the process definition.
Work Item The representation of the work to be processed in the context of an activity within a process instance.
An activity typically generates one or more work items which together constitute the task to be undertaken by the user within this activity.
Invoked Application An invoked application is a workflow application that is invoked by the workflow management system to automate an activity, fully or in part, or to support a workflow participant in processing a workitem.
[edit] Week 7 - February 15, 2008
[edit] Pentaho Data Integration (formerly known as "Kettle")
Kettle is an open source ETL tool that has been acquired by Pentaho.
ETL stands for Extract, Transform, and Load. ETL is the process that actually gets the data into a data warehouse. Data gets extracted from various outside data sources. It then gets transformed to fit business needs. For example, unnecessary fields can be removed from tables, originally integer fields can be changed to char fields, etc. Finally, the transformed data gets loaded into the data warehouse.
Since Kettle is an ETL tool, it can extract, transform and load data into Pentaho. Kettle consists of four components:
- Spoon - is a graphically oriented end-user tool to model the flow of data from input through transformation to output. One such model is also called a transformation
- Pan - is a command line tool that executes transformations modeled with Spoon
- Chef - is a graphically oriented end-user tool used to model jobs. Jobs consist of job entries such as transformations, FTP downloads etc. that are placed in a flow of control
- Kitchen - is a command line tool used to execute jobs created with Chef
Definitions taken from ttp://rpbouman.blogspot.com/
[edit] Spoon
The Spoon tool allows you to create transformations with a modeling technique. Spoon allows you to create transformations by modeling how you want your data to flow.
[Demo for Spoon]
Installing MySQL on Vista poses a problem. There seem to be some kind of security issue/firewall issues with Vista and MySQL. I will try to see if I can get it fixed and running by next week.
[edit] Week 8 - February 22, 2008
[edit] More on Pentaho Data Integration -Spoon
Spoon is an ETL tool provided by the Pentaho Data Integration package. It does exactly what ETL is - extracts, transforms and loads data into the data warehouse.
It takes a modeling approach, which does not require
[edit] Creating Basic Transformations
[edit] Core objects
These are the types of functions that are used for the extraction, loading and transformation of data.
[edit] Input
These are the types of files that data is coming from.
I have created sample data CSV files that I will use to manipulate.
Major.csv
Student_Num Major Minor Tassel_Color 100000001 Computer Science Math Yellow 100000002 Nursing None Orange 100000003 Nursing None Orange 100000004 Biology None Green 100000005 Physics None Blue 100000006 Computer Science Math Yellow 100000007 Engineering Math Yellow 100000008 Engineering None Yellow 100000009 Engineering None Yellow 100000010 Biology None Green 100000011 Physics None Blue 100000012 Computer Science None Yellow 100000013 Computer Science English Yellow 100000014 Math None Red 100000015 Math None Red 100000016 Psychology None Purple 100000017 Psychology English Purple 100000018 Math History Red 100000019 Computer Science English Yellow 100000020 Computer Science None Yellow
Student.csv
CID First_Name Last_Name Expected_Grad_Date 100000001 Amanda Smith 2009 100000002 Michael Paine 2008 100000003 Louise Allen 2009 100000004 George Johnson 2009 100000005 Michelle Williams 2009 100000006 Victor Brown 2010 100000007 Jill David 2008 100000008 Jack Miller 2009 100000009 Dorris Moore 2009 100000010 Nemo Taylor 2008 100000011 Patrick Anderson 2010 100000012 Frank Jackson 2010 100000013 Priscilla White 2009 100000014 Richard Harris 2008 100000015 Amber Thompson 2011 100000016 Tim Clark 2010 100000017 Bob Hall 2010 100000018 William Young 2011 100000019 Ryan Jones 2011 100000020 Larry Baker 2011
[edit] Output
The types of output can be physical files, to the database, etc.
[edit] Adding CSV File
To do that, all you have to do is drag and drop from the Input list the "CSV file input."
Double click on the icon in the workspace to change/add information.
After adjusting the attributes, you can click on the 'Preview' button to preview the data that is in the file.
[edit] Adding Stream Lookup
Stream lookup is looking up and comparing data values from different tables/files.
Connecting a 'hop' from the Major.csv and Students.csv files to the Stream Lookup means that the Stream Lookup step will be looking and comparing those two files.

Lookup step: The table you want to pull data from.
The key(s) to look up the value(s): attributes to compare.
Specify the fields to retrieve: The fields that you want to retrieve from the file that you specified in the Lookup step.
My goal here is to link up the student names with their corresponding major and minor.
Since I want to list everything in the Student.csv file, I will look up the majors and minors. So, change the 'Lookup step' field accordingly.
In the Students.csv file, the student ID is called "CID." In the Major.csv file, the student ID is called "Student_Num." So I change the 'The key(s) to look up the value(s)' accordingly.
The only information that I want from the Major.csv file is the 'Major' and 'Minor' columns.
[edit] Adding Dummy
The dummy icon is placed there to allow me to see what happens when the transformation is ran.
[edit] Preview Result
Click on the Preview button to see the results!
[Demo for Spoon]
[edit] Week 9 - February 29, 2008
[edit] Pentaho Data Integration - with MySQL
[edit] Loading Data into MySQL
[edit] Using MySQL
Creating a table called Students:
CREATE TABLE Students ( CID INT(9), First_Name VARCHAR(255), Last_Name VARCHAR(255), Expected_Grad_Date INT (4) );
Populating Students with values:
INSERT INTO Students (CID,
First_Name,
Last_Name,
Expected_Grad_Date)
VALUES ('100000001', 'Amanda', 'Smith', '2009'),
('100000002', 'Michael', 'Paine', '2008');
[edit] Using Spoon
Select a CSV/Excel file containing your data.
Fill in the Target Table field. This will be the name of the table containing your data.
Click New after the Connection text field and specify your connection.
Here, I am using MySQL and my database name is University.
Here is my database in MySQL:
Click Test to make sure Spoon is able to connect to your database. Below shows the confirmation.
Going back to the Table Output menu, click on SQL. Then Execute. This is the code that will be input into MySQL.
So here, it creates a table with the specified fields.
Going into MySQL, I check and see that the table Students is indeed created.
After running the transformation, the data from the CSV file is successfully loaded into my database.
[edit] Splitting Tables in MySQL
I have loaded the Major.csv file into my database and named the table, major. Now, I want to split the data from the major table into two separate tables (one for computer science majors only, the other for the rest of the majors) and load those two tables into my database.
This is how the transformation will look like:
[edit] Result Transformation
The following transformation will combine all the steps mentioned above and perform those steps all at once:
[edit] Week 10 - March 7, 2008
[edit] Final
[edit] Brief project description
BI basically takes existing data, organizes it so that it can be presented to help businesses and organizations make informed decisions. The BI System integrates databases from different business processes into one centralized system. From there, users can create customized reports to view revenue earned, business predictions and trends, global performance, etc.
Why is this different from a simple SQL query?
- BI allows more flexible reporting capabilities
- Allow more user-friendly ways of displaying data (graphs, charts, etc.)
- Allows non-technical users to create their own reports and navigate/manipulate existing reports.
- SQL operates on one database. BI can integrate multiple databases and data sources.
Pentaho offers a way to create queries and pull data from the database for those who don't know coding. At the same time, it lets experienced users dive down into the code and change it as it see fits.
Users can create ad hoc reports with back-end being Pentaho, an open source Business Intelligence suite. The databases that the reports will come from will be school databases - Student name, CIN, major, etc. Due to time, I might not have time to implement ad hoc reporting capabilities. Users will only be able to manipulate canned reports.
For example, an administrator needs to decide which class to cancel. He/she can make a decision based on which classes that were offered in the past had the least enrollment/least popular. Note: As of now, I am unable to retrieve actual data. The data that I am operating on now are sample/self-made data.
They can create reports like:
'Which class has the lowest enrollment in the CS Department?'
List enrollment number for the following parameters: Department: CS Courses: All Quarter From: Spring 2006 Quarter To: Winter 2008
A graph/chart would be generated.
There might be a separate database for courses, one for departments, one for teachers, etc. So Pentaho would have the ability to go into all these different databases and pull out information that the administrator needs.
Goal:
Demonstrate my learning of Business Intelligence through a full-scale BI implementation using Pentaho.
[edit] Anticipated users
Two types of users:
1. Faculty members, school administrators, and whoever needs student/school status information.
2. People who want to learn more about Pentaho's Business Intelligence Suite and why it is better than performing simple SQL queries.
[edit] Main conceptual (i.e., user-level) objects
For users who want to explore Pentaho:
They will have to use the tools that Pentaho provides, for example, Data Integration and Report Designer.
For users who only want to see reports:
Pull down menus are mainly what the users will need in order to customize a report to generate.
I will provide the fields and parameters for the user based on the database.
[edit] Primary conceptual (i.e., user-level) operations
For canned queries:
The primary operation the user will do is to simply specify the parameters for their customized report. I will provide the parameter values based on the database.
[edit] Why I am interested in this project
Businesses are always looking at ways to improve their businesses and maximize profits. To make smarter business decisions, companies rely on historical data and previous records. Pentaho deals with organizing databases and allowing users to pull information out of multiple databases to view. The whole Business Intelligence concept really interests me so I decided to take this opportunity to learn more about it.
Moreover, by implementing a BI system, I can show that I am able to learn something new on my own.
[edit] Status
I have learned how to use the Pentaho Data Integration tool well enough to meet the requirements needed for my project. I am able to use Spoon to manipulate databases.
I have also played around with the Pentaho Report Designer tool.




































