SAS with Hadoop

SAS’ recent announcement of an alliance with Hortonworks marks a good opportunity to summarize SAS’ Hadoop capabilities. Analytic enterprises are increasingly serious about using Hadoop as an analytics platform; organizations with significant “sunk” investment in SAS are naturally interested in understanding SAS’ ability to work with Hadoop.
 
Prior to January, 2012, a search for the words “Hadoop” or “MapReduce” returned no results on the SAS marketing and support websites, which says something about SAS’ leadership in this area. In March 2012, SAS announced support for Hadoop connectivity; since then, SAS has gradually expanded the features it supports with Hadoop.
 

As of today, there are four primary ways that a SAS user can leverage Hadoop:

  • Legacy SAS users can connect to Hadoop through the SAS/ACCESS Interface to Hadoop
  • SAS Enterprise Miner users can export scoring models to Hadoop with SAS Scoring Accelerator
  • SAS LASR Server, the back end for SAS Visual Analytics, can be co-located in Hadoop
  • The SAS High Performance Analytics suite can be co-located in Hadoop

Let’s take a look at each option.

“Legacy SAS” is a convenient term for Base SAS, SAS/STAT and various packages (GRAPH, ETS, OR, etc) that are used primarily from a programming interface. SAS/ACCESS Interface to Hadoop provides SAS users with the ability to connect to Hadoop, pass through Hive, Pig or MapReduce commands, extract data and bring it back to the SAS server for further processing. It works in a manner similar to all of the SAS/ACCESS engines, but there are some inherent differences between Hadoop and commercial databases that impact the SAS user. 
 
SAS/ACCESS also supports six “Hadoop-enabled” PROCS (FREQ, MEANS, RANK, REPORT, SUMMARY, TABULATE); for perspective, there are some 300 PROCs in Legacy SAS, so there are ~294 PROCs that do not run inside Hadoop. If all you need to do is run frequency distributions, simple statistics and summary reports then SAS offers everything you need for analytics in Hadoop. If that is all you want to do, of course, you can use Datameer or Big Sheets and save on SAS licensing fees.

A SAS programmer who is an expert in Hive, Pig or MapReduce can accomplish a lot with this capability, but the SAS software provides minimal support and does not “translate” SAS DATA steps. (In my experience, most SAS users are not experts in SQL, Hive, Pig or MapReduce). SAS users who work with the SAS Pass-Through SQL Facility know that in practice one must submit explicit SQL to the database, because “implicit SQL” only works in certain circumstances (which SAS does not document); if SAS cannot implicitly translate a DATA Step into SQL/HiveQL, it copies the data back to the SAS server –without warning — and performs the operation there.
SAS/ACCESS Interface to Hadoop works with HiveQL, but the user experience is similar to working with SQL Pass-Through. Limited as “implicit HiveQL” may be, SAS does not claim to offer “implicit Pig” or “implicit MapReduce”. The bottom line is that since the user needs to know how to program in Hive, Pig or MapReduce to use SAS/ACCESS Interface to Hadoop, the user might as well submit your jobs directly to Hive, Pig or MapReduce and save on SAS licensing fees.
SAS has not yet released the SAS/ACCESS Interface to Cloudera Impala, which it announced in October for December 2013 availability.
 
SAS Scoring Accelerator enables a SAS Enterprise Miner user to export scoring models to relational databases, appliances and (most recently) to Cloudera. Scoring Accelerator only works with SAS Enterprise Miner, and it doesn’t work with “code nodes” — which means that in practice must customers must rebuild existing predictive models to take advantage of the product. Customers who already use SAS Enterprise Miner, can export the models in PMML and use them in any PMML-enabled database or decision engine and spend less on SAS licensing fees.

Which brings us to the two relatively new in-memory products, SAS Visual Analytics/SAS LASR Server and SAS High Performance Analytics Server. These products were originally designed to run in specially constructed appliances from Teradata and Greenplum; with SAS 9.4 they are supported in a co-located Hadoop configuration that SAS calls a Distributed Alongside-HDFS architecture. That means LASR and HPA can be installed on Hadoop nodes next to HDFS and, in theory, distributed throughout the Hadoop cluster with one instance of SAS on each node.
 
That looks good on a PowerPoint, but feedback from customers who have attempted to deploy SAS HPA in Hadoop is negative. In a Q&A session at Strata NYC, SAS VP Paul Kent commented that it is possible to run SAS HPA on commodity hardware as long as you don’t want to run MapReduce jobs at the same time. SAS’ hardware partners recommend 16-core machines with 256-512GB RAM for each HPA/LASR node; that hardware costs five or six times as much as a standard Hadoop worker node machine. Since even the most committed SAS customer isn’t willing to replace the hardware in a 400-node Hadoop cluster, most customers will stand up a few high-end machines next to the Hadoop cluster and run the in-memory analytics in what SAS calls Asymmetric Distributed Alongside-HDFS mode. This architecture adds latency to runtime performance, since data must be copied from the HDFS Data Nodes to the Analytic Nodes.

While HPA can work directly with HDFS data, VA/LASR Server requires data to be in SAS’ proprietary SASHDAT format. To import the data into SASHDAT, you will need to license SAS Data Integration Server.
 
A single in-memory node supported by a 16-core/256GB can load a 75-100GB table, so if you’re working with a terabyte-sized dataset you’re going to need 10-12 nodes. SAS does not publicly disclose its software pricing, but customers and partners report quotes with seven zeros for similar configurations. Two years into General Availability, SAS has no announced customers for SAS High Performance Analytics.

SAS seems to be doing a little better selling SAS VA/LASR Server; they have a big push on in 2013 to sell 2,000 copies of VA and heavily promote a one node version on a big H-P machine for $100K. Not sure how they’re doing against that target of 2,000 copies, but they have announced thirteen sales this year to smaller SAS-centric organizations, all but one outside the US.
 
While SAS has struggled to implement its in-memory software in Hadoop to date, YARN and MapReduce 2.0 will make it much easier to run non-MapReduce applications in Hadoop. Thus, it is not surprising that Hortonworks’ announcement of the SAS alliance coincides with the release of HDP 2.0, which offers production support for YARN.
Advertisements

[TELECOM] AAA – Authentication Authorization Accounting

What is AAA?

AAA stands for (i) Authentication (ii) Authorization and (iii) Accounting.

Authentication :

  • Refers to confirmation that a user who is requesting a service is a valid user.
  • Accomplished via the presentation of an identity and credentials.
  • Examples of credentials are passwords, one-time tokens, digital certificates, and phone numbers (calling/called).

Authorization :

  • Refers to the granting of specific types of service (including “no service”) to a user, based on their authentication.
  • May be based on restrictions, for example time-of-day restrictions, or physical location restrictions, or restrictions against multiple logins by the same user.
  • Examples of services include, but are not limited to: IP address filtering, address assignment, route assignment, encryption, QoS/differential services, bandwidth control/traffic management.

Accounting :

  • Refers to the tracking of the consumption of network resources by users.
  • Typical information that is gathered in accounting is the identity of the user, the nature of the service delivered, when the service began, and when it ended.
  • May be used for management, planning, billing etc.

AAA server provides all the above services to its clients.

To Add:
What is RADIUS?
What is DIAMETER?

[Unix] Case change (Capitalize, Lowercase, Uppercase, CamelCase) operations

Example 1:
$ cat file
Capitalize first letter of every word of a line

If you want to capitalize first letter of every word of all lines, it can be done as follows:

$ sed -i -e ‘s/^./\U&/g; s/ ./\U&/g’ file


$ cat file
Capitalize First Letter Of Every Word Of A Line

^. – represents  character that follows beginning of a line i.e. first character of the line

  .(space and dot) – represents every character that follows space

U – represents uppercase ( L can be used for lowercase if needed)


Example 2:

If you want line to perform this operation on selective lines , you can provide line number as follows:

$ sed -i -e ‘1,3 s/^./\U&/g; s/ ./\U&/g’ file

Now, this operation will be performed on first three lines



Example 3:
Source=”ONE TWO THREE FOUR”

Target=”One Two Three Four”

Solution is below:

Target=`echo $Source | tr [A-Z] [a-z] | sed -e ‘s/^./\U&/g; s/ ./\U&/g’`

echo $Target
One Two Three Four


Example 4:
To capitalize all the letters of 1st line

sed -i -e ‘1 s/.*/\U&/g;’ file 

[Unix] Different ways to print the next few lines after pattern match

grep, awk or a sed command is used to print the line matching a particular pattern. However, at times, we need to print a few more lines following the lines matching the pattern. In this article, we will see the different ways in which we can get this done. The first part explains ways to print one line following the pattern along with the pattern, few examples to print only the line following the pattern and in the end we have ways to print multiple lines following the pattern.

Let us consider a file with the following contents as shown below:

$ cat file
Unix
Linux
Solaris
AIX
SCO

1. The simplest is using the grep command. In GNU grep, there is an option -A which prints lines following the pattern.

$ grep -A1 Linux file
Linux
Solaris

In the above example, -A1 will print one line following the pattern along with the line matching the pattern. To print 2 lines after the pattern, it is -A2.

2. sed has the N command which will read the next line into the pattern space.

$ sed -n ‘/Linux/{N;p}’ file
Linux
Solaris

First, the line containing the pattern /Linux/ is found. The command within the braces will run on the pattern found. {N;p} means read the next line and print the pattern space which now contains the current and the next line. Similarly, to print 2 lines, you can simply put: {N;N;p}. Example 7 onwards explains for printing multiple lines following the pattern.

3. awk has the getline command which reads the next line from the file.

$ awk ‘/Linux/{print;getline;print;}’ file
Linux
Solaris

Once the line containing the pattern Linux is found, it is printed. getline command reads the next line into $0. Hence, the second print statement prints the next line in the file.

4. In this, the same thing is achieved using only one print statement.

$ awk ‘/Linux/{getline x;print $0 RS x;}’ file
Linux
Solaris

getline x reads the next line into variable x. x is used in order to prevent the getline from overwriting the current line present in $0. The print statement prints the current line($0), record separator(RS) which is the newline, and the next line which is in x.

5. To print only the line following the pattern without the line matching the pattern:

$ sed -n ‘/Linux/{n;p}’ file
Solaris

The n command reads the next line into the pattern space thereby overwriting the current line. On printing the pattern space using the p command, we get the next line printed.

6. Same using awk:

$ awk ‘/Linux/{getline;print;}’ file
Solaris

Multiple lines after the pattern:
GNU grep may not available in every Unix box. Excluding grep option, the above examples are good only to print a line or two following the pattern. Say, if you have to print some 10 lines after the pattern, the command will get clumsy. Let us now see how to print n lines following the pattern along with the pattern:

7. To print multiple(2) lines following the pattern using awk:

$ awk ‘/Linux/{x=NR+2}(NR<=x){print}' file Linux Solaris Aix To print 5 lines after the pattern, simply replace the number 2 with 5. This above example is a little tricky. Once the pattern Linux is found, x is calculated which is current line number(NR) plus 2. So, we will print lines from the current line till the line number(NR) reaches x. 

8. To print 2 lines following the pattern without the line matching the pattern: $ awk ‘/Linux/{x=NR+2;next}(NR<=x){print}' file Solaris Aix The next command makes the current line, which is the pattern matched, to get skipped. In this way, we can exclude the line matching the pattern from getting printed. 

9. To print 2 lines following the pattern along with the pattern matching line in another way. $ x=`grep -n Linux file | cut -f1 -d:` $ awk -v ln=$x ‘NR>=ln && NR<=ln+2' file

Using grep and cut command, the line number of the pattern in the file is found. By passing the shell variable to awk, we make it print only those lines whose line number is between x and x+2.

10. One more way using sed and awk combination. First we calculate the from and to line numbers in the variables x and y. Using sed printing range of lines, we can get the same. sed can not only deal with numbers, it can work with variables as well:

$ x=`awk ‘/Linux/{print NR}’ file`
$ y=`awk ‘/Linux/{print NR+2}’ file`
$ sed -n “$x,$y p” file

OR

$ x=`awk ‘/Linux/{print NR+2}’ file`
$ sed -n “/Linux/,$x p” file

Unix – Control Structures

The control structures may work differently in different shell.

if – elif – else – fi

Bash:

if [ “$var” = “a” ] || [ “$var” = “b” ]; then
   echo “something if”

elif [ “$var” = “c” ]; then
   echo “something elif”

else
   echo “something else”
fi


Conditional Expression:


It has been checked with bourne and bash shells.

[condition] if true then A else B

[ -f /etc/hosts ] && echo “Found” || echo “Not found”

[MySQL] – Export/ Re-create indexes on prod/live database

If you want to export indexes, simply use mysqldump. All the definitions of table would be exported.  
This kind of dump would be useful if you want to import it fully.  But, if you just have created indexes on dev env and want to create the same on prod env without recreating the whole structure of table, mysqldump becomes useless. In that case, use ‘SHOW CREATE TABLE’ statement to see the indexes (KEYs) and create them in production database.

dev mysql> SHOW CREATE TABLE city;
CREATE TABLE `city` (
  `id` smallint(4) unsigned NOT NULL auto_increment,
  `city` varchar(50) character set utf8 collate utf8_bin NOT NULL default ”,
  `region_id` smallint(4) unsigned NOT NULL default ‘0’,
  PRIMARY KEY  (`id`),
  KEY `region_idx` (region_id),
  CONSTRAINT `city_ibfk_1` FOREIGN KEY (`region_id`) REFERENCES `region` (`id`) ON UPDATE CASCADE ON DELETE RESTRICT
) ENGINE=InnoDB;

live mysql> SHOW CREATE TABLE city;
CREATE TABLE `city` (
  `id` smallint(4) unsigned NOT NULL auto_increment,
  `city` varchar(50) character set utf8 collate utf8_bin NOT NULL default ”,
  `region_id` smallint(4) unsigned NOT NULL default ‘0’,
  PRIMARY KEY  (`id`)
) ENGINE=InnoDB;

live mysql> ALTER TABLE `city` ADD KEY `region_idx` (region_id);
 

live mysql> ALTER TABLE `city` ADD CONSTRAINT `city_ibfk_1` FOREIGN KEY (`region_id`) REFERENCES `region` (`id`) ON UPDATE CASCADE ON DELETE RESTRICT;

Also, if you would create these indexes in prod, they will take too much time (depending upon various factors).

The mysqldump utility, also, is useful, if performance isn’t your main concern.  If you’re looking for a very fast method, though, I would suggest copying the actual InnoDB files from one cold database to the other (assuming they’re exactly the same MySQL version with the exactly the same configuration and the exactly the same expected behavior, etc). This method is dangerous if there any differences between the systems. You might want to copy your good data to your testing environment first that may lead to different consequences. 
[NOTE]: ARTICLE UNDER CONSTRUCTION.  THE INFO PROVIDED ABOVE HAS NOT BEEN VERIFIED YET.

Visit your website before actual DNS propagation

  How DNS Gets Resolved 
There are times when you may wish to test a website you have designed, before the actual domain name’s DNS (Domain Name System) entry is updated (or DNS propagation is done).
After changing the nameservers with your registrar, you can check the status of DNS propagation on www.whatsmydns.net

 DNS Explained

Think of yourself placing a phone call to your the 411 Operator. You ask the operator “What’s the phone of Joe’s Pizza in Paramus, New Jersey?” The Operator looks through all the company listings in that area and finds the listing for Joe’s Pizza. The Operator replies “The phone number for Joe’s Pizza in Paramus New Jersey is 201.983.7564 I will connect you now…” and the next thing you know you are speaking with Joe’s Pizza ordering your dinner. 
The Internet is the same way for the most part. Just like every company out there will have a different and unique phone number, every web site out of the Internet will have a specific numeric address assigned to it know as an IP address or “Internet Protocol Address”. Most people do not realize that Internet servers can only be addressed via IP addresses. However it would be impossible to remember that the IP Address for Joe’s Pizza is 201.983.756.4 so the Domain Name System was created.
Host names are the web site addresses you see every day: http://www.google.com, www,joespizza.com, and so on. We used these “words” like http://www.google.com so humans don’t need to remember the long IP address numbers when they want to visit a web site.
Now let’s use the first example of placing a traditional phone call into the context of visiting a web site using your computer’s browser to explain how web addresses work.
You happen to know that the web site address for your favorite pizza joint is http://www.joespizza.com (because that’s a lot easier to remember than some four part numerical value). You type “joespizza.com” into your web browser. When you hit the Enter key, your request is sent to the “Internet 411 Operator” used by your Internet Service Provider company (or ISP for short). In the world of the Internet, this “411″ Operator is know as a Domain Name Server. This Domain Name Server (DNS) looks through all its domain name entries for Joe’s Pizza. The DNS Server is thinking “The web site JOESPIZZA in the .COM domain is being hosted by a web server with the IP Address of 201.983.756.4″ and forwards you to that location and before you know it, you are looking at the web site for Joe’s Pizza.

What is the Hosts File On My Computer?

Simply put, the Hosts file is similar to an address book. Exactly like the example above, when you type an address like http://www.joespizza.com into your web browser, the Hosts file on your own computer is referenced to see if you have the IP address or “telephone number” for that web site. If you do, then your computer will use that number it has on file locally to “call” and the corresponding web site opens. If not, your computer will ask the DNS Server belonging to your Internet Service Provider for the associated IP address for the corresponding web site and connects you to that web site. The majority of the time, you will not have all the IP Addresses of all the web sites from the entire Internet in your “address book”. You will probably have very few (if any) entries within your Local Hosts file. Therefore, most of the time your computer will ask for the IP addresses of web sites you wish to visit from your ISP.

Why Would I Want to Edit The Host Files On My Computer?

Sometimes when designing a new web site, you may need to test certain aspects of the site before launching the site live to the general public. By editing your local Hosts file, you can affect what happens when you type in a certain web site address on your own system by redirecting the web browsers on your computer to a different IP address to view that particular site than the rest of the world would see. So while the rest of the users of the Internet type “www.joespizza.com” into their web browsers and get redirected to the web server at the IP address 201.983.756.4, visiting the same “www.joespizza.com” web address on YOUR own computer only could bring you to the IP address of say 131.34.23.5 (your testing web server, for example). Once you are done testing your site, you could then edit your local Host file again to connect to the site as a regular Internet user would.
 
To test your web site using your own domain name BEFORE DNS propagation has completed, you can use your local computer’s HOSTS file. Your computer will use the entries in your HOSTS file FIRST before it try to use your IPS to looks up the DNS information for your domain.

REMEMBER: When you are finished testing, remember to remove the custom lines that you added to your Hosts file.

How to Edit Your Hosts File on a Windows PC (Windows 95/98/Me/2000/XP/2003/Vista/7)

Let us assume for this example your testing server has an IP address 88.46.57.157 and you wish to visit that server when you type “http://example.com” into a web browser BUT still wish to still see the site as the rest of World Wide Web does when you enter “http://www.example.com” into your browser instead.
  1. Locate the HOSTS file on your computer. Typically it is in one of the following locations:
    • Windows NT/2000/XP/2003/Vista/7 – C:\windows\system32\drivers\etc\hosts
    • Windows 95/98/Me – C:\windows\hosts
  2. Open this file with a text editor such as Notepad or Wordpad.
    • Right-click on Notepad and select the option to Run as Administrator – otherwise you may not be able to open this file.Then, open the file. Consider performing a “Save As” so you have an original copy of the file that you can restore later. You will see two columns of information, the first containing IP addresses and the second containing host names. By default, a windows hosts file should be similar to the following:
      (In Windows 7 Press and hold Ctrl+Shift while opening the Notepad/Wordpad).

    • Filename: hosts
      127.0.0.1 localhost


      You can add additional lines to this file that will point requests for a particular domain to your new server’s IP address.
      Example:


      Filename: hosts
      127.0.0.1 localhost
      88.46.57.157 example.com


  3. Save your changes (be sure to save as a host file, not as a text file).
  4. Restart any currently open browsers. You may also want to flush your DNS cache. In Windows XP, go to Start, and then Run, then type “cmd” and hit enter.
    Type the following:ipconfig /flushdns
  5. In your web browser you should see your site as it appears on your testing server when typing http://example.com/ but still be able to see the site on its current web server by visiting http://www.example.com/

How to Edit Your Hosts File on an Apple Macintosh Using Mac OSX

Let us assume for this example your testing server has an IP address 88.46.57.157 and you wish to visit that server when you type “http://example.com” into a web browser BUT still wish to still see the site as the rest of World Wide Web does when you enter “http://www.example.com” into your browser instead.
  1. Open Terminal, which is in Applications, then the Utilities folder.
  2. You may want to first make a backup copy of your existing hosts file:
    sudo cp /private/etc/hosts /private/etc/hosts-orig

    Enter your user password at the prompt.Then type the following command to edit your hosts file:

    sudo nano /private/etc/hosts

    Enter your user password at the prompt if asked.

  3. You will see a file with contents similar to the following:


    Filename: hosts

    ##
    # Host Database
    #
    # localhost is used to configure the loopback interface
    # when the system is booting. Do not change this entry.
    ##
    127.0.0.1 localhost
    255.255.255.255 broadcasthost
    ::1 localhost
    fe80::1%lo0 localhost

    Using the arrow keys on your keyboard, navigate around this file an add your domain and IP address to the bottom of the file. For example:



    Filename: hosts

    ### Host Database# # localhost is used to configure the loopback interface
    # when the system is booting. Do not change this entry.
    ##
    127.0.0.1 localhost
    255.255.255.255 broadcasthost
    ::1 localhost
    fe80::1%lo0 localhost
    88.46.57.157 example.com

  4. When done editing the hosts file, press the keyboard combination Control+O to save the file.
    Then press the Enter on the filename prompt to confirm the Save operation. Finally press the keyboard combination Control-X to exit the editor.You may also need to grant yourself sudo priveleges, if you got a permission error in Step 2. In your “Help” menu, search for “root” and select the instructions for “Enabling the root user.” Follow those instructions.
  5. Restart any currently open browsers. You may also want to flush your DNS cache.
    Type the following command into your Terminal window:dscacheutil -flushcache
  6. In your web browser you should see your site as it appears on your testing server when typing http://example.com/ but still be able to see the site on its current web server by visiting http://www.example.com/

How to Edit Your Hosts File on Ubuntu/ Unix:

On Unix and Unix-like OS, the host file is located at /etc/hosts (for many versions).