Thursday, February 9, 2012

a general query and db question

Hello,

I have a couple of general questions that don't require help figuring out as much as help pointing me in the right direction.

I have a table that has 3 columns. it basically contains links to outside urls that contains these columns

URL ID description.

Now I was thining that since each url is different and that I want a easy way to check that it is never entered twice, I would make the URL field the primary. The ID field is generic in nature and will be auto incremented. It is only there becuse it will have a use later on.

Description is just a description.

Anyway, my concern is search time. I assume that once this table gets rather large, it will take a lot longer to search through.

So what I am asking is this:

What should I be looking at here to make sure that this searches fast?

is there maybe a better way to design this to allow for quicker searching. Maybe breaking the links up into categories and have a category table to search from?

Is using the url a good idea for a primary?

Just general information pointers would be helpful. Links, ideas etc.

Thanks

Is using the url a good idea for a primary?

yes. . . ABSOLUTLY!!! it is the piece of data that by its nature must be unique. The id, set to an Identity(int ,1,1) is what is referred to as a surrogate key and would be used as a foreign key in other tables, such as your catagory table and should be set as a Unique Index. While it is unique, it has no real meaning in your data, therefore it is not the primary key.

|||

Hi,

I have to disagree with that answer.

The URL might change for some reason, it's living data.

Use the identity key as your primarykey. Put an unique index on the url. All foreign keys from related tables points at your primarykey ie your identity key.

That makes searches faste since you use an in t och bigint as your PK.

Regards

|||

JMattias wrote:

Hi,

I have to disagree with that answer.

The URL might change for some reason, it's living data.

Use the identity key as your primarykey. Put an unique index on the url. All foreign keys from related tables points at your primary key ie your identity key.

That makes searches faste since you use an in t och bigint as your PK.

Regards

This is a matter of style.

To me, a primary key is the tuple of data that identifies the natural uniqueness of the data within the domain of the data. A unique identifying big int has no relevance to URLs. It is simply a "Surrogate Key" that would be used to simplify the data model. Yes add the unique Identifier and put the surrogate key in the child tables, but it is not and should not be, the primary key.

Some people believe as you do, and it goes against everything I learned in my database theory courses.

Speciufying the URL as a primary key as opposed to the unique int ID has no bearing on speed.

Primary keys are unique indexes with the constraint that there can only be one on the table.|||

You know, I am looking over the googles on Surrogate Key and there really is alot of bad information out there.

Most create primary keys based on surrogate key and then set a unique id on the natural key as you have suggested.

This is not correct - by definition, A primary key is the tuple that by the combination of its qualities determines the integrity of the entity. thats it. . . nothing more.

If your model finds that the data in the primary key may change over time, that calls for a surrogate key. A surrogate key by definition is a piece of data that can stand in for the primary key when needed. It does not replace the primary key.

Physically, it makes no difference on performance - Consider at the following

My way:

Create table foo(ID int unique not null, Data varchar(255) not null primary key)

Create table bar(FooID int references Foo(ID), FooData varchar(255), primary key (FooID, FooData))

your way:

Create table foo(ID int unique not null primary key, Data varchar(255) not null unique)

Create table bar(FooID int references Foo(ID), FooData varchar(255), primary key (FooID, FooData))

both work, both perform the same. But. . .

In mine, the Primary Key decoration on Foo.Data properly models the logical architecture that "Data" is the relative tuple of data that determines the Foo domain of data and ID is a unique item of a secondary nature that has no meaning in the domain.

In yours, the Primary Key decoration on Foo.ID implies that ID is the relative tuple that determines the Foo domain of data and that the uniqueness of Data is a secondary consideration. This is incorrect. ID does not come from the domain of Foo. It is an arbitrary number that has no meaning. Therefore by definiton it is not the primary key.

I remember this discussion vividly from my college database courses.

Just because everybody does it doesn't make it right. . . look at all the people using VBSUX to develop database applications.

|||

Hi,

You're right that it will work the same way. But I personally would not reference a foreignkey to an non key field. It's unique that's for sure, but I wouldn't. It's a matter of theory as you said.

Regards

|||Thanks guys, the information has been very helpful|||

JMattias wrote:

You're right that it will work the same way. But I personally would not reference a foreignkey to an non key field. It's unique that's for sure, but I wouldn't. It's a matter of theory as you said.

So I was reading, as I often do, and I came across something that may be of concern if you put the primary key in the surrogate -

if you execute this sql:

create table Test1(SurrogateKey int identity not null primary key, NaturalKey varchar(255) not null unique)

And then generate the script from the Enterprise manager/ Management Studio, you get the following table definition:


CREATE TABLE [dbo].[Test1](
[SurrogateKey] [int] IDENTITY(1,1) NOT NULL,
[NaturalKey] [varchar](255) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
PRIMARY KEY CLUSTERED
(
[SurrogateKey] ASC
)WITH (IGNORE_DUP_KEY = OFF) ON [PRIMARY],
UNIQUE NONCLUSTERED
(
[NaturalKey] ASC
)WITH (IGNORE_DUP_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY]

Notice, SurrogateKey is a clustered index. However, a application would never directly select on SurrogateKey and any selection via a foreign key relationship will not be sequential.

So, I'm thinking, if you wanted the SurrogateKey to be the primary key, this would be more optimal -

create table Test1(SurrogateKey int identity not null primary key, NaturalKey varchar(255) not null unique clustered)

|||

For all intents and purposes, I had decided to use the URL as the primary.

The reason being two fold:

First, if a link is dead, it will be removed. If it has been changed, there is something in place for that already. I have a side program in mind that will check links randomly using a separate db. When changes are made a new entry is made to the active db to cut down on too much db use.

Second, with having the URL as a primary it was easier to check that someone wasn't entering the URL twice. Doing it that way was easier considering some of the other stuff involved.

Last, I don't see the point of putting an index on the URL. Maybe I don't have the reasons for putting indexes on columns fully understood but I was under the impression that you would index the columns that would be most searched and the URL would never be searched or in any sequential order that would be relevant to a search..

If I am wrong, please tell me

|||

Hi,

An index is not only for your searches in the db. It's also used by the query optimizer when calculating costs etc. So if you join on the primarykey the index might be used fetching your data.

About indexes: Should bee applied to the columns that are referenced via joins, that are searched on, and sometimes you can use covering indexes ie indexes covering all the columns in your select.

Regards

No comments:

Post a Comment